The Seven Dwarfs of Data On-boarding in Splunk

By: Rupak July 26, 2012

In my time working with and using Splunk, I have learned a few tricks and tips to make the Splunk experience even better. This post assumes you are familiar with a few Splunk keywords. If you are having trouble following along, take a look at this link and look up the terms: http://docs.splunk.com/Splexicon. If you have never seen Splunk before, I suggest taking a look at the Splunk Tutorial to familiarize yourself with the product: http://docs.splunk.com/Documentation/Splunk/latest/User/WelcometotheSplunktutorial

Splunk ships with a multitude of configuration files with a slew of attributes that it can use to help understand and make sense of your data. However, you may be able to help Splunk with some of the heavy work it takes on as it interprets your data. Instead of having Splunk run through its default configuration files and attributes, why not identify exactly what you want Splunk to do with your data? There are attributes you can set in the props.conf file that will help Splunk index your data faster and exactly the way you want it to.

When bringing on new data, you should always index the data into a test environment versus adding the new data directly to a production environment indexer. Since there is no way to remove data once its been indexed (that is, without cleaning out the entire index), it is always better to run a few tests first. If you do not have a test/development environment, you can always download Splunk on your personal computer and run your tests there. If you cannot get a sample set of your data, you can create a test index on one of your production indexers. The bottom line here is that you should never index data on your production indexers without testing it first.

The attributes in props.conf that I mentioned earlier can be easily forgotten, so here is a little mnemonic device to help you. There happens to be seven attributes you want to set in props.conf every time you bring data into Splunk. You can relate them to the Seven Dwarfs from Disney's story of Snow White.

Dopey → TIME_PREFIX. This is the first attribute and Dopey is usually the first dwarf that everyone remembers. This attribute is used to tell Splunk where to start to look for the timestamp in your event.
Happy → MAX_TIMESTAMP_LOOKAHEAD. Setting this attribute makes Splunk happy and run more efficiently because it will not have to spend any extra time and resources to find the time stamp. You can tell Splunk that your timestamp is 20 characters into your event so Splunk will not waste any time looking through the entire event.
Sleepy → TIME_FORMAT. Many people "sleep" on this attribute and shouldn't. It is very important to help Splunk interpret your data. With this attribute you are telling Splunk the format your time stamp is in using strptime format (http://www.tutorialspoint.com/python/time_strptime.htm). Splunk will not have to try to figure out if 10/2/12 is October 2, 2012, February 10, 2012, or even something weird like December 2, 2010.
Doc → SHOULD_LINEMERGE. Like Doc, this attribute is the leader and decision maker of the group. Depending on the value of this attribute, other attributes are required. This setting should be set to "false" and used along with the LINE_BREAKER attribute, which can greatly increase processing speed.
Grumpy → LINE_BREAKER. This attribute looks the toughest and most intimidating. It identifies how your events should be broken apart. This is important because if this is not set correctly you can have data that is spread across multiple events. Using regular expressions, Splunk will look for that specific pattern and break up your events accordingly, and as mentioned before, this should be used in conjunction with SHOULD_LINEMERGE = false.
Sneezy → TRUNCATE. This attribute is nothing to sneeze at. Measured in bytes, this attribute limits the length of an event (line of data), and will break when the limit is met. The default value is 10000 so set this to 999999 bytes or more depending on the size of your event. Because your data is not being broken up into multiline events (given SHOULD_LINEMERGE = false) you will want to ensure your events do not get broken up incorrectly.
Bashful → TZ. Like Bashful, the time zone attribute is probably the most forgotten of the group. You will want to set the time zone for each host in your Splunk environment. This will ensure the time is displayed correctly on your search head.

For more information about each of these props.conf attributes, take a look at the following link: http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf

I hope you found this helpful. Happy Splunking!!

^{Image courtesy of http://www.tripletsandus.com/disney/dwarfsmisc.htm}

Tags: Analytics, Big data, Data Inputs, Indexing, Machine data, Operational Intelligence, props.conf, Splunk

Blog