Organizing Your Splunk Shoe Rack (Defining Index Structures , Part 1 of 2 )

image

Image courtesy of: FreeDigitalPhotos.net

Your Splunk Shoe Rack

When splunking with a new customer, one the first things I review when auditing their environment is their index structure. Why? Well there's a lot you can tell about the maturity of a Splunk deployment based on this particular configuration. The old saying that Forrest learned from his mom comes to mind...

"Momma always says there's an awful lot you could tell about a person by their shoes. Where they're going. Where they've been. I've worn lots of shoes. I bet if I think about it real hard I could remember my first pair of shoes." - from the movie Forrest Gump

In many burgeoning installations, all data is sent to the "main" (default) index.  This approach is like using one pair of shoes for everything you do such as going to the office, hiking, dancing up a storm, or playing tennis.  This may be okay when you're 5, but not at 35... well not at 10 really for that matter.  The all-in-one shoe doesn't have the right attributes for each task.  Whereas having a pair of loafers, tennis shoes, and boots to choose from allows you to accomodate a given task.  Now imagine all of these shoes neatly organized on a shoe rack.  This is how you should think about your Splunk index structure.

Okay, so let's take a step back. Why does index structure matter?  Isn't it okay to just dump all your organization's data into the "main" index and never have to type something like "index=tennis_shoes"? Of course not! As your installation grows a few things will inevitably happen:

  1. different data types will be ingested
  2. different groups of people will need access to different sets of data
  3. different types of data will have different retention requirements
  4. search performance will be come a bigger concern

Here's how a well thought out index structure addresses these items.

Security and Data Access Control

In many organizations, access to data is given on a "need to know" basis. There might be sensitive information in the data sources that only certain members of the organization should be able to see. If all the data is in one index then controlling access becomes complicated. It's possible to set search filters per role, however this is less optimal because of maintaining the filter configuration and the performance hit since data has to be retrieved and then filtered. By segregtating data, you can take advantage of the ability to set which indexes a Splunk user role has access to. An example of this strategy would be creating an index to receive firewall data and only allowing the "operations" group access to that index.

Data Retention

Organizations have different requirements for the amount of time they want to keep data. The retention policy is usually dictated by three factors, listed below in order of highest to lowest precedence

  1. There is a legal requirement to store the data for a certain amount of time, such as industry or federal compliance standards. Think PCI or FISMA.
  2. There is an organizational need to keep the data. For example, a development group might want to keep application log data for their entire three month development cycle to monitor performance during that span.
  3. Although storage capacity is growing tremendously, it's still finite and has a real cost. By retaining data for only as long as it's required, an organization is able to best utilize their data storage resources. (why keep around old, worn out pairs of shoes you never wear?)

Let's take firewall data as an example. An organization may have a legal requirement to keep the data for one week. The IT operations group finds it useful to be able to search the data for two weeks. However, because of the volume of data generated by the firewalls, the current storage configuration only alows for data to be kept for ten days. In this case, the "firewall" index would be configured to purge data after it's ten days old.

Search Performance

In general, you want to think about the types of searches that will be done on the data after it's indexed.  Let's say an "application" index contains data from applications A, B, and C. However, data from app C only accounts for 10% of the volume and is usually searched on it's own, not in combination with data from apps A and B. In this case, it would be better to create an index for data from app C to improve search performance, since Splunk does not have to remove events from apps A and B after pulling them from the indexers. However, if most searches go across data from apps A, B, and C, such as when defining a transaction across multiple systems, then keeping them in the same index would make sense as data pulled from the indexers would potentially be relevant to the search.

Hopefully this post provided some insight into how to plan your index structure.  In Part 2 of this series I plan to go through a use case and how to implement the above strategy through Splunk configuration.

Comments

Subscribe to Our Newsletter

Stay In Touch