READER BEWARE: This got long and geeky, so make sure you're really trying to avoid doing "real work" before you sit down to read.

Welcome back all, and thanks for joining me for the final installment of a three part series on decompiling Java code and analyzing stack traces. If you're interested in the back story, you can read about decompiling Java here, and analyzing basic stack traces here.

When we last left off, our hero was frozen in a block of Carbonite and left to his doom. Err...sorry, wrong story, let me start again.

When we last left off, we walked through analyzing a simple stack trace to run down a bug in a standalone Java program. This is all fine and good, but what happens when you're running a Java application server (like Tomcat or Weblogic) that has hundreds of concurrent threads running tens of different web apps? And you're just trying to figure out why your particular web application is hung?

Well, if you're lucky, the developer of your web application was a good boy/girl, and they're logging stack traces for you in a log file somewhere. If this is the case, then you can open up the log and analyze the trace like we did the last go round.

Often times though, you're not so lucky, and you have to dig a little deeper to figure out what's going on. Say, for instance, your web application just starts responding very slowly. You see from access logs that responses are being served back to users, but they're about 5 times slower than normal...WTF? Or, what if your application server starts pegging the box at 100% CPU...what to do? Or, out of nowhere, your app server starts throwing Java.lang.OutOfMemory exceptions...Gah!!! Sadly, these nebulous problems seem to happen in a production environment more often than most of us would like to admit. And when they do occur, it's usually a high stress situation because there's probably a production outage and nobody really knows why. So, how do we find a better fix to the problem than the traditional, "Let's just bounce it and see what happens" response? Why, we use "The Force", of course. Except in this case, "The Force", is just a set of debugging tips that I'm getting ready to share with you as follows:

  1. Take a deep breath and don't freak out when a bunch of people start yelling.
  2. Understand the severity of the situation. Figure out how much downtime you can tolerate for debugging before things really hit the fan.
  3. Set expectations. Let people know when they can expect the issue to be fixed. If you know a bounce will temporarily fix the problem, then set a drop-dead time for debugging and schedule a bounce. Let folks know that if you don't find a root-cause of the problem, that they should expect to see the issue again.
  4. Gather as much information about the problem as you can. What, specifically, are users experiencing? Can you reproduce? What are the symptoms? What log messages do you have? What does the server environment look like? etc.
  5. Try to relate this issue to something you've seen in the past. Does this look like a problem you saw last week or last month? What did you do to fix it then? Why is it popping up again now?
  6. Eliminate possible external causes. Is this actually a network problem in disguise? Is the database acting up? Is some other process on the server eating up CPU? Is the server constantly swapping because it doesn't have enough RAM to handle all its business?
  7. Eliminate environment changes as possible causes. Has the server environment changed recently? Could this be causing your problem?
  8. Look in earnest at your application server process. Are we bumping up against the Java max. heap size? Are we seeing lots of garbage collection? Is our thread pool exhausted? Is our JDBC pool maxed out?
  9. Take a look inside of the application server JVM. Generate a thread dump to get a snapshot of how your application server is behaving. Analyze the stack traces in the thread dump, and look for points of interest, like...race conditions, stuck threads, deadlock, etc.
  10. Open a ticket with the software vendor (If applicable)
  11. Pray the issue fixes itself and doesn't come back
  12. Find a new job so you don't have to deal with these problems anymore

Read on for more fascinating details about The Stack Trace...

AJAX Refresher

Comments (0)

It's been a while since we touched on AJAX, but a question came up recently about it an I thought it might be good to review. AJAX, or "Asynchronous JavaScript and XML", is a way for portlet developers to create rich Web Applications that don't require the entire browser page to refresh to update content.  This is done by making asynchronous calls to the server and updating content within the page itself.  With the AquaLogic portal, this means that portlets can dynamically update content in <div> tags by requesting new content without having to refresh the entire page (and other portlets on the page).  It's a pretty simple concept; in many cases you can accomplish this without having to even change any code - you can just specify "inline refresh" on the Web Service and the portal will automatically rewrite the HTML links on the page to make AJAX calls:

inline_refresh.jpg

The HTML rewrites cause the browser to make the HTTP request "behind the scenes", and when a response comes back, the portal refreshes the content inside the portlet <div> tag.

But there are some things to know about this AJAX stuff, so here are a couple of refresher points about AJAX:

1) The response to an AJAX request is basically just a text string to a browser, and it's up to your JavaScript to interpret it.  Often you do something like:

document.getElementById("responseDivTag").innerHTML = response.getResponse();

... to refresh content.  But note that this doesn't tell the browser to "process" the response - specifically, JavaScript that comes back in the response won't run, because all we're doing is setting the HTML to a string that comes back from the server.  In order to run JavaScript in the response, you should look into the JavaScript "eval()" method, which will take a string returned from the server and run it as JavaScript.  Just make sure you don't include the <script> tags in your response if you really are returning JavaScript and are parsing it as such.

2) The response does not have to actually be HTML!  It's just a string to the browser, and you can do anything with it.  The most common use (which all of our products use) is to return JSON, or "JavaScript Object Notation", which can then be treated as objects that your script can handle however you want.  Let's say you just want to know if there was a success or failure: you could literally just return a "0" or "1" in your response and write something like:

if (response.getResponseText.equals("1"))

   alert("success!");

else

   alert("fail");

Obviously, this just barely scratches the surface on AJAX, and you can rest assured that you haven't heard the last of it.  AJAX is the cornerstone of pretty much all future Oracle portal technologies, and if you're a web developer who's not all that familiar with it, trust me:  you will be soon.

The Stack Trace Strikes Back

Comments (1)

Howdy all. Welcome to part two of three of what was originally conceived as a one part series. It's entirely possible that I'll get all George Lucas on you years from now and produce some more of these posts that are a complete letdown and affront to your childhood memories, but I digress. For now, rest assured that this post will knock your socks off as a follow-up to my last tidbit on decompiling Java code.

Without further ado, I give you...Stack Wars II: The Stacktrace Strikes Back (I'm completely aware that I'm abusing the metaphor here, but isn't that really what blogging is all about?).

Standard disclaimer: This post is intended for a technical audience with a focus on production support. Also, everything here is Java focused, but you can certainly apply some of the concepts in a .NET environment as well...you'll just have to create your own screencaps to replace the examples I've included below.

So, what is a stack trace, and why should you care? Well, one question at a time please.

What is a stack trace?

Wikipedia says a Stack Trace is, "A report of the active stack frames instantiated by the execution of a program." Now, I vaguely understand the Wikipedia definition, but I have also have a computer science degree from a second tier state university, so let me try to translate for those of you who were smart enough to get degrees in something besides CompSci: a Stacktrace is a snapshot of a program's behavior at a point in time. In the Java world, a stack trace will tell you which method was being executed at the time the trace was generated, along with its complete call stack, and usually line numbers as well. Take a look at the following simple stack trace below as an example:

stack_trace.png

Why should you care?

Good question. I'd venture a guess and say that about 99.99% of the world doesn't need to know nor care about stack traces. But here you are reading this post none-the-less, so here's why they're important:

1) Good programmers almost always print stack traces out in log files when an error in a program occurs. This gives us a useful tool to track down bugs. Whether you're just reporting information to a support team somewhere, or getting a little sassy and trying to fix a problem yourself, the stack trace is like a map for finding treasure buried deep in code. Except that instead of finding actual treasure, you're just finding a logic error. And instead of getting rich, you just get to complain about a problem, and maybe fix it.

2) You can tell the JVM to generate a stack trace for a running process. Doing so allows us to take a snapshot of the JVM at an arbitrary point in time, and see what all its threads are up to. This is useful when trying to figure out why a process (Tomcat for instance) is zombied (i.e. it's running, but not responding to requests), or when you're trying to fix deadlock issues, which are particularly difficult to run down.

Hit the jump for learning more about reading and interpreting stack traces!

Hi, my name is Brian.  I like sunsets, long walks on the beach, puppies, and de-compiling Java code.  If you have similar interest, please read on.  On the other hand, if you're not a Java programmer, this post might not be for you.

So of all the cool tools out there, JAD is, without a doubt, my absolute favorite...and it's not even close.  One of the beautiful things about Java is that the byte-code interpreted by the JVM can be reverse-engineered into source code.  JAD, or JavA Decompiler, is a tool for doing just as the name implies: decompiling byte code into source.  But so what?  Why would this possibly be useful?  Well, here's a list of just a few things that I've done by de-compiling code:

  • Found software bugs
  • Saved about 5 weeks of hassle with BEA support by pointing them to the line of source code that is causing a bug
  • Re-compiled bug fixes back into code as a band-aid solution to a problem (Note that this probably voids any/all support agreements that you've ever had with any one)
  • Understood how/why a piece of software works like it does (I.e. "Why is Analytics deleting all a bunch of historic information when I run a specific job...there's no document on it")
  • Found cool/un-documented software features (Again...no support, but playing around with stuff is fun)
  • This one is my favorite: Long ago, for "Bring your kids to work day", I was on the hook for talking to a bunch of 12 year olds about programming.  So I de-compiled a Ms. Pacman applet, found the code that does collision detection, modified it to not work, and re-compiled the new code back in.  The net result was a Ms. Pacman who could run through ghosts, and a bunch of kids learning how "If" statements work.
  • Also, this guy I know might have one time de-compiled a VI plugin for Eclipse to take out the license key check.  Don't worry though, I quickly told him how he was a bad boy, and he's been on the straight and narrow ever since.

OK, sorry for the rambling there, but just wanted to give you a flavor of the stuff you can do by de-compiling code.  More generally though, if you're a programmer at heart, it's always a good idea to look at source code, just to understand how things work.  Of course there are fringe benefits too:

  • Fellow portal geeks marvel over how "smart" you are because you "de-compiled" the portal and found some problem.
  • You'll get to complain about what idiots the people who wrote the code are
  • Men will want to be you
  • Women will want to be with you

So, let's look at a quick example, shall we?  JAD is a text-based utility, so get ready to fire-up your DOS prompts and/or Shell windows.  I'm writing this post on a PC, so my examples will be through DOS, but UNIX folks, just make the normal DOS->*NIX changes, and you'll be fine.

Let's assume that I've been trying to get a piece of proprietary software to run for a while now, and every time I start it up, I get a stacktrace like the following:  stack_trace.png 

Now, you could argue that I should know better than to purchase a piece of software named, "Broken", but I'd argue right back that you're getting in the way of my example. 

Want to see JAD in action? Join me after the link for all the details!

Howdy all.  After a long hiatus I'm back to blog your socks off with some technical minutiae that will save a few of you lots of headaches, and help the rest of you get a good night's sleep.

 

Long Story Short (i.e. "Just the useful info, please") 

There's a bug in the JDK that causes problems when setting up certain types of Java server socket constructs on some Solaris 10 boxes.  This bug will likely manifest itself in one of the following ways in your ALUI environment:

 

1)  You're running the ptlogging utility and see your logs filled up with something like:

 

PTSelectorThread_17679958      

com.plumtree.openkernel.impl.openhttp.core.network.PTSocketSelector     Unexpected exception.

java.io.IOException: Invalid argument

        at sun.nio.ch.DevPollArrayWrapper.poll0(Native Method)

        at sun.nio.ch.DevPollArrayWrapper.poll(Unknown Source)

        at sun.nio.ch.DevPollSelectorImpl.doSelect(Unknown Source)

        at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)

        at sun.nio.ch.SelectorImpl.select(Unknown Source)

        at sun.nio.ch.SelectorImpl.select(Unknown Source)

        at com.plumtree.openkernel.impl.openhttp.core.network.PTSocketSelector.run(PTSocketSelector.java:400)

        at java.lang.Thread.run(Unknown Source)

 

2) Portal starts up fine, but it can't connect to any remote servers, giving you the same stacktrace as above:

 

java.io.IOException: Invalid argument

        at sun.nio.ch.DevPollArrayWrapper.poll0(Native Method)

        at sun.nio.ch.DevPollArrayWrapper.poll(Unknown Source)

        at sun.nio.ch.DevPollSelectorImpl.doSelect(Unknown Source)

        at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)

        at sun.nio.ch.SelectorImpl.select(Unknown Source)

        at sun.nio.ch.SelectorImpl.select(Unknown Source)

        at com.plumtree.openkernel.impl.openhttp.core.network.PTSocketSelector.run(PTSocketSelector.java:400)

        at java.lang.Thread.run(Unknown Source)

 

So what's a good god-fearing person like yourself to do?  Well, if you act now, we'll send you TWO workarounds for the price of one:

 

1) Have an SA up the hard File Descriptor (FD) limit on the server to 8193 or greater by editing /etc/system to have the line:

 

set rlim_fd_max=8193

 

Note: You'll need to bounce the box after this change.  You can then verify it worked by running:

 

ulimit -n -H

 

Which should return a number >= 8193.  Sadly, this approach will probably have the SA asking you why you want to make the change, which means you'll have to read the technical details below....so onto option #2

 

2) Tell the appropriate JVMs to use a different Socket Selector configuration.  You do this by passing the following option to the JVM:

 

-Djava.nio.channels.spi.SelectorProvider=sun.nio.ch.PollSelectorProvider

 

Depending on what ALUI component you're updating, you may pass this option in different ways.  For instance, if you're dealing with one of the back-end servers (Collab, Studio, etc), you'll want to update wrapper.conf to add additional arguments like:

 

wrapper.java.additional.7=-Djava.nio.channels.spi.SelectorProvider=sun.nio.ch.PollSelectorProvider

 

Note: Replace "7" in the above line with the appropriate number for your wrapper file.

 

Or you may just need to update a shell script somewhere that's kicking off Tomcat/Weblogic/etc.  Note that these scripts all have their own Shell variables for holding additional Java arguments, so just look through them and update as appropriate.  If you have problems, feel free to post questions here and we'll do our best to help out.

 

 

Long Story Long (i.e. "I'm kind of a geek, and I'm sitting at work with nothing else to do, so give me the details")

So ALUI uses the NIO java packages that were introduced in Java 1.4.  FWIW, I always thought NIO stood for Non-blocking IO, but a little Googling reminds me that it's actually New IO...silly me.  In any case, the NIO packages let you do some cool things with sockets to more efficiently manage high-volume connections.  The under-lying problem you're running into is that out-of-the-box Selector implementation in the JDK uses /dev/poll to allocate 8192 File Descriptors (FD) for use by the selector, and 8192 exceeds the nofiles (Number of File Descriptors) limit on your server.  So, you can either bump up the server FD limit ala work-around #1 above, or tell Java to use a different selector implementation that doesn't allocate all those FDs ala option #2 above.  If you're interested in more detail, you can find the Sun bug on the issue here.

 

Until next time...thanks for reading! 

AquaLogic: Alive and Kicking

Comments (1)

We just got back from OracleWorld last week, and wow!  That conference is unbelievably huge, but very well-organized.  We met a lot of current and past friends and colleagues out there, but a big part of our attendance was to get more insight into the future of the AquaLogic portal stack.

Once again, Oracle reaffirmed 9 years worth of support for the AquaLogic line, but their commitment to keeping Aqualogic as the primary code base wasn't as strongly expressed.  As you may know, Oracle has the WebCenter Suite - of which ALUI (WebCenter User Interaction), will play a part.  But Oracle's portal and BEA's portal will also remain in this space.  The current plan is to take most of the existing integration products and portals and fit them together in a more standards-based suite of producers (Oracle Services) and consumers (portal products), so you'll be able to leverage all of the cool new services that Oracle is working on with any of these portals.

They're also working on this thing called Oracle WebCenter Spaces, which is a fantastic-looking UI for creating "portals" with JDeveloper and Oracle's Application Developer Framework.  This thing was slick - we went to a lab where we created a rich mock-customer UI which integrated elements of Oracle's Content Management system (which'll replace Publisher) using JDeveloper and just dragging-and-dropping widgets onto a page and wiring the objects together with property pages.  The thing felt more like BEA's portal (which was more developer-focused) than AquaLogic's (more configuration than development), but it was still pretty impressive. 

We were told all of those rich components will be surfaced through the AquaLogic portal as well via WSRP, so in theory, ALUI customers aren't left out in the cold as these rich components are developed.  As Oracle fine-tunes the ultimate portal strategy, they're also developing these components as "WebCenter Services", which are basically standards-compliant portlets that will be available to all portals.  So we should be getting the best of all worlds - existing portals plus all the "generic" web services (like discussion groups/collab, and the new Enterprise 2.0 functionality). 

The big outstanding question, though, is whether WebCenter Spaces will become the de-facto "portal" that gets the most development effort from the Oracle Team on a go-forward basis.  Eventually these products will have to start consolidating, and time will tell whether that's nine years from now where each product has gradually started looking like the next anyway, or 2 years from now where more dramatic migration strategies will need to be put in place.  My take?  Either way, it's no big deal.  We Plumtree folks have had some pretty major upgrade/migrations to go through before, and we were always taken care of.  Remember how hard the 4.5WS -> 5.0 migration was?  Well, we haven't had a major game-changing upgrade since then (the 5 -> 6 upgrade was pretty clean, but was more evolutionary than revolutionary).  If Oracle does go with a big migration strategy from the ALUI code base to something else, I'd bet that the migration won't be as complicated as the 5.0 migration was, and the rich new set of functionality available will make it worthwhile.

The important thing here is:  don't panic!  ALUI will be around and recognizable for a LONG time, and judging by what we saw at OracleWorld, there are lots of exciting new technologies just around the corner that will make it even more valuable.

Remotely Reboot Windows Servers

Comments (2)

You may never need this one, but when and if you do, you'll be glad you've got this tip in your back pocket.  No doubt, you never actually sit in front of a console for your AquaLogic Servers; if you're on Windows, you're using Windows Remote Desktop.

Occasionally, you need to reboot those servers, and you do it through the remote desktop.  But once in a blue moon (or more - it's happened to me on a half dozen machines at one client and 2 machines at another in the last month), you go to reboot and Remote Desktop never comes back.  Typically, this is because Windows has started shutting down and killing windows processes - including Terminal Services.  But for whatever reason, Windows doesn't reboot, and Remote Desktop is no longer available because the process is gone and isn't restarted.

The solution:  remotely force the box to reboot again.  Simply run:

shutdown /r /f /m \\servername

... from any of the machines in the subnet. You're telling the OS to force (/f) a reboot (/r) on a remote machine (/m). Works every time the OS is available but the Terminal Service is dead - check out the MS article for more info on this command.

Beware the Security Propagation Bug(s)

Comments (1)

We've warned you before about ACL propagation when you're changing the security in ALI.  Heck, we even created a product to ease the pain of this important task.  Today's bug is about another issue with security propagation.

Well, it's actually 2 bugs (maybe 3).  Let me explain:

When you answer "yes" to that question about security propagation, a job is created.  Here's the problem:  the job is run as the user who created the folder, not the user changing the security.  What if you later delete that user?  Well, bug #1: automation server is hosed.  You're going to get an error like this:

failed-job.jpg

The Exception says "*** Job Operation #1 of 1 with ClassID 20 and ObjectID 898 cannot be run, probably because the operation has been deleted.", and the error's wrong (because it says "probably", I won't count this as a bug).  The real problem in this case is that the folder's OWNER has been deleted, not the operation itself.

This gets me to Bug #2: when you delete a user, they're removed from all groups, but apparently they're not removed as OWNERS of any of the admin folder objects (and how could they?  What should they revert back to?).  Obviously, this is what causes the problem with automation server.  If this one was fixed somehow, bug #1 would become irrelevant.

Bug #3 (this one I haven't confirmed yet, but is likely to exist given the testing and logs mentioned above): suppose Joe Content Manager creates an entire structure of folders.  Then Joe's boss decides he shouldn't be an admin for one of the child folders (say, an "Executive Committee" folder) and removes his access.  He changes the security and propagates it on that "Executive Committee" folder.  Later, he wants to add Mary Executive to have privileges to the folder.  He changes the ACL again and chooses to propagate.  Here's the rub:  because the job is run as Joe, and Joe no longer has access, the job fails, and Mary can not be added.  In fact, it's likely that Joe can't even be added back, so that ACL is completely frozen unless the admin goes through every object and re-adds Joe back (or changes the owner of the folders through the DB).  Have I mentioned LockDown?

Anyway, want the SQL Scripts to fix the owners on the various folders?  Hit the jump.

This is a little old-school, but still a very relevant tip.  This problem has been around for all of the ALI 6.x days (including 6.5), and if I remember correctly, even the Plumtree 5.x days.  Basically, the automation server logs its activity to the PTJOBLOGS table, and if you've got jobs that are logging in verbose or very frequently, this table can become HUGE:  I had a recent customer whose database had grown to over 80 GB because of this table.

Basically, to prevent this table from growing astronomically big, make sure:

  1. You don't have jobs that are running way too often (once a minute typically means a job is running - and logging - constantly)
  2. You don't have verbose logging for any jobs running in the portal, and
  3. That your portal configuration is set to only save a reasonable amount of job log data.  By default, the portal will keep around 60 days worth of logs, which is a pretty big number.  If you ask me, any job logs older than 7 days are worthless because jobs always run more than once a week.  But, of course, you're not asking me, so I'll just tell you how to change this setting and you can decide for yourself.  The configuration setting isn't available through the UI; instead you have to tweak the database.  Specifically, you have to update the PTSERVERCONFIG table in the ALI database with SETTINGID=15.  Set the value to whatever you think is appropriate based on your use of the job log:

ptjoblogs.jpg

Finally, what happens if you've already got a PTJOBLOGS table that contains 200 million rows?  Here's a tip that I got from our friend and client, Mike Jones at MedSolutions, Inc.: if you run an amateur update like "delete * from ptjoblogs", the update will take forever because all transactions will be logged.  If you run "truncate table ptjoblogs", on the other hand, you're just dropping the table and don't have to wait ridiculous amounts of time for the update to happen (not to mention the amount of database log space you'd be consuming).

Here's a little feature that some of you may find useful:  Collaboration Server 4.5 can send you an email if it can't connect to the new Notification Service.  For those of you that have countless problems with the old Notification Server that shipped with Collab 4.2 and earlier, this is a must-have feature.

Just go to Administration: Select Utility...: Collaboration Administration: Collaboration Notification, and enable Health Monitoring:

 

collab_notify2.jpg 

It works, too (assuming your SMTP server doesn't require authentication for internal addresses).  Collab is even kind enough to mark the mail as "High importance":

 

collab_notify_email.jpg

On the other hand, for those of you that don't appreciate this new little nugget of functionality, consider this irony:  Collaboration 4.5 uses email to tell you when email is broken.  Cosmic, man.