Lessons Learned: Upgrading a Splunk Instance with No Downtime
Upgrading a single machine's instance of Splunk is easy. All we need to do is stop the instance, download either the .tar or the .rpm, and then either untar or yum install the package, restart Splunk, and voilà, we have an upgrade!
Updating a Splunk deployment of 30+ Splunk instances all at the same time is also easy, if you're allowed to stop all the instances at once. All we need to do is the same thing as above, either 30 times... or script it to be done 30+ times.
However, what do we do when we need to upgrade a customer who requires 24 hour uptime, but has a half-dozen heavy forwarders, a robust indexer cluster, two or more search head clusters, and management servers for each (deployment server, cluster master, and multiple deployers)?
Just kidding! There's a procedure that needs to be followed exactly, but it's not very intuitive, as certain tiers need their management instance upgraded first (cluster master), others need their management instance upgraded last (deployer), and others don't care about their management server, but need to be installed in a certain order compared to the other tiers (forwarders/deployment server).
So, with a full Splunk instance that includes heavy forwarders, an indexer cluster, and one or more search head clusters, this is the order they need to be installed in:
- Cluster Master*
- Search Heads*
- Heavy Forwarders
- Deployment Server
- Universal Forwarders (optional)
What, you may ask, are the steps above that were marked with asterisks? Glad you asked! Those are the steps that don't follow our "easy" stop, upgrade, restart process we talked about in the first line above. In addition to the quirks below, you'll want to upgrade at least steps 1-4 as quickly as possible to avoid version mismatches causing any errors or warnings. Also, before performing ANY upgrade, always back-up configurations on all Splunk instances and the indexed data on the indexers.
If the environment has a standalone License Master, you're in luck! You can upgrade it at any time!
So, I lied a little bit up there, in that, upgrading the cluster master is pretty straightforward, but right after you restart the cluster master, you're going to need to put the cluster master into maintenance mode. Why would we do that? It's so the cluster master doesn't try to keep our search or replication factors while we are upgrading members of the cluster. Forgetting this step will slow things considerably in step 4, which, even when done "fast," is by far the slowest step of all that were listed above.
Alright, now we can head to the search head, and you'll need to pick a lucky search head. I tend to pick something that's numbered 3 or above, because, well, those kids, like me, rarely were picked for dodgeball. Then, go into that search head, stop it, follow the upgrade process, restart it, and then you need to make it the search head cluster captain! (Nerd's revenge!) Once that one upgraded member is made captain, all the other search heads can be upgraded using the super easy process we talked about above. Then, once all the search heads in a cluster are upgraded, the deployer can be upgraded, too. While this wasn't tough, when the process is followed, any deviations from this can break the search head cluster, forcing you to wipe them all and start the cluster from scratch, so... let's not do that instead.
Indexers (a.k.a. Search Peers)
Well, here's the step that takes the longest, but only because Splunk is working hard to make sure not to lose any of your data. Rude, right? Instead of our easy way, we're going to offline our indexers one by one, upgrade, then restart the indexer. Sounds simple, right? Well, depending on your environment, offlining an indexer can take nearly 30-60 minutes, especially if your upgrade window coincides with the nightly "let's run all the reports while users are asleep and system load is much lower" window. The toughest part about this step is ignoring the temptation to try to offline as many indexers as you have search (or replication) factor. While you CAN do that, offlining multiple indexers scales on an order of magnitude that we don't like: offlining one indexer takes 30 or so minutes in a 500GB/day environment, but offlining two indexers at the same time in that environment takes two and a half hours. Don't be like me - learn from my experience and offline only one indexer at a time!
Don't forget to take your cluster master out of maintenance mode once the entire indexer tier has been upgraded... the longer you wait, the more time it will take for the cluster to get back to a valid state.
Note: This procedure above deviates from the official Splunk Docs version of upgrading, which states the steps of: 1) enable maintenance-mode on the cluster master, 2) stop (not offline) all the search peers, 3) upgrade all the search peers, 4) restart all the search peers, and 5) disable maintenance-mode on the cluster master. While this will work to upgrade the system faster than the above process, it introduces downtime into the system and will interrupt any reports or saved searches that should have appended during the time the upgrade is being performed. If your customer and their Splunk instance cannot tolerate ANY downtime, the official process will not work for you, and you need to use the process I described above!
Last Words of Wisdom:
Remember, everything above has been written from hard-won experience, but now, when you need to upgrade a complex Splunk environment, you can avoid the many pitfalls that can result in issues like increased overall upgrade time, the loss of a search head cluster, lost reports, environment downtime, and other horrors!