Monitoring Frozen Data Storage in Splunk

By: Anshu April 26, 2016

Frozen Wasteland

In this post, I'd like to visit the "Siberia" of Splunk data or frozen (archived) storage. For all other types of data besides frozen, you can get insight on your Splunk data at the index and bucket level by using the "dbinspect" command or apps like "Fire Brigade." However, because frozen data "lives" outside of the world of Splunk, there's no way to get insight on that data via Splunk. Therefore, I will outline a solution for creating a scripted input to send metrics to Splunk which can then be used for reporting.

Create the Script

In our sample environment, frozen data is stored on each indexer on the "/data/frozen/" path. Inside of the "frozen" directory, are directories for each index, which contain frozen buckets. For example, archived data for the "windows" index would be in "/data/frozen/windows/" directory and would contain many frozen buckets.

One of the metrics we wish to obtain is how much space the frozen data is taking up per index and in total. Below is a bash script to collect this data. I made the comments fairly verbose to help illustrate what is going on.

---frozen_storage_metrics.sh---

#!/bin/bash

#Do not output STDERR messages
exec 2>/dev/null

#Set the value for the frozen path
FROZEN_PATH="/data/frozen"

#Capture the current timestamp for use when outputting events
CURR_DATE="`date +%Y/%m/%dT%T`"

#iterate through each index in the frozen path.  The "_dir" variable is used to store all the paths
for _dir in "$FROZEN_PATH"/*/
do
        #Extract the portion of the path with the index name and store for later use
        CURR_IDX=$(echo $_dir | perl -pe 's/\/data\/frozen\/([^\/]*)\//\1/')

        #Use the "du" command to get the size of the directory.  Only the "total" line is used and is then transformed to include a field name "frozen_size_mb"
        FROZEN_SIZE_MB=$(du -cms "$_dir"/ | grep 'total' | perl -pe 's/(\d+)\stotal/frozen_size_mb=\1/')

        #Output the data into a Splunk-friendly event format, which includes:
        # 1. A timestamp at the beginning of the event
        # 2. A delimited set of key-value pairs

        echo $CURR_DATE,index_name="$CURR_IDX","$FROZEN_SIZE_MB"
done

#Get the total size of the frozen path.  This is optional since the events produced above can be aggregated to produce a total amount
FROZEN_TOTAL_SIZE=$(du -cms "$FROZEN_PATH"/ | grep 'total' | perl -pe 's/(\d+)\stotal/frozen_size_mb=\1/')

#Output the data.  Set "index_name" to "all" since this is for the entire path.
echo $CURR_DATE,index_name=all,"$FROZEN_TOTAL_SIZE"

---

I included verbose comments in the script to help illustrate what it is doing. One thing to note is that the last "du" command to get the total size of the frozen path, could have been used to get the total size for all the individual indexes. This would have created an event with a tabular format, which then could have been parsed in Splunk at search-time into separate events. The script above is formulated to produce a single event per index, which alleviates any search-time manipulation of the data.

Also, the directory name for the index is being used as the "index_name." We know, however, that this directory could be any name depending on how the index paths are configured in indexes.conf. In practice, most index configuration will just use the name of the index for the name of the directory on the filesystem. One example where this is not the case for a default index is the for the "main" index which appears as "defaultdb" on the filesystem.

You can test the script by executing it at the command line. When doing so you should see output similar to the following:

2016/04/01T10:38:09,index_name=windows,frozen_size_mb=95
2016/04/01T10:38:09,index_name=os,frozen_size_mb=67
2016/04/01T10:38:09,index_name=defaultdb,frozen_size_mb=4
2016/04/01T10:38:09,index_name=network,frozen_size_mb=14
2016/04/01T10:38:09,index_name=mcafee,frozen_size_mb=1
2016/04/01T10:38:09,index_name=all,frozen_size_mb=181

This format will be easy to parse for Splunk and will minimize the amount of both index-time and search-time configuration necessary to use the data.

Create the Scripted Input

Now that we have our script created, we'll create an add-on to package it and the inputs configuration. We'll call the app "acme_TA_indexer_metrics." The script will be stored in the "bin" directory inside of the app. The inputs.conf configuration will be stored in the "default" folder. The inputs.conf configuration would appear as follows:

---default/inputs.conf---
[script:://./bin/frozen_storage_metrics.sh]
index=os
disabled=true
sourcetype=indexer_storage_metrics
interval=3600

The scripted input is configured to send data to the "os" index, set the sourcetype to "indexer_storage_metrics," and run every hour (the "interval" attribute is set to 3600 seconds or one hour).

When activating the input, you can create a copy of the stanza in the inputs.conf in the "local" folder

---local/inputs.conf---
[script:://./bin/frozen_storage_metrics.sh]
disabled=false

Not shown here, but index-time configuration to explicitly extract the timestamp and set line-breaking would be included in the add-on in the "default/props.conf" file.

In our scenario, this add-on would be configured with inputs enabled and deployed to the indexers which have access to the frozen data. It's main purpose there is to execute the script and collect the data. If additional search-time configuration was added to this add-on, it would also be deployed to the search heads. Note that Splunk needs to be installed on a server that has access to the frozen data. Another consideration is that if the data is not stored on the indexers then a portion of the file path could be extracted and used as a "hostname" field to identify which indexer the frozen data originated from. For example, the data might be organized in the following way "/data/frozen/indexer1," "/data/frozen/indexer2," and so on. The "indexer1" portion of the path could be used in this case to specify the indexer.

Now that this data is collected in Splunk you can use it for monitoring the size of archived or frozen directories. An example search would be

index=os sourcetype=indexer_storage_metrics index_name="all" | eval frozen_size_gb = frozen_size_mb / 1024 | timechart max(frozen_size_gb) by host

The above search would give the total size of each indexer's frozen storage over time.

Conclusion

Because frozen data is "orphaned" from Splunk, alternative ways can be created to gather metrics on this data. This approach covered how to collect storage size for archived data. This approach could also be used to collect metrics such as number of buckets. Thanks for reading and happy Splunking!

Tags: Splunk, Operational Intelligence, Big data, Script, Bash, Frozen Data, Archived Data, Scripted Input, Event Format

Blog