Think small, search faster

By: Ridwan August 19, 2013

Compared to a few years ago, it is almost unbelievable that we are able to sift the amount of data we can, and the speed with which we can do it. But like the fast cars we drive today that are much faster than similar cars of yesteryear, we get used to the speed we have, and soon wonder: “can it go faster?”

Yes, but with some conditions.

Like Turbocharging, Supercharging, and enlarging the displacement in cars-- all valid ways to make a fast car go faster, --in Splunk there is Report Acceleration, Summary Indexing, and searching tsidx files. Each has its own benefits and drawbacks, so it’s important to understand when which approach is appropriate. The basic idea for all of these methods is the same—to limit the set of data in which the search has to be performed. Beyond that, it is a matter of resources—space on the searchhead, space on the indexers, and how current the results need to be.

Summary Indexing

Summary Indexing involves creating small indexes by using specific search commands. “Summary indexing is a method you can use to speed up long-running searches that use commands that are not streamable before the reporting command. It's similar to report acceleration in that it involves populating a data summary with the results of a search, but in this case the data summary is actually a special summary index that is built and stored on the search head. This summary index is populated by a scheduled search that is based on the search that you'd like to run faster.

Report Acceleration

Report acceleration was introduced in Splunk 5.0, and is the simplest of the methods of search acceleration; “setting it up is as easy as clicking a checkbox and setting a time range. Future runs of the search should run faster as long as they're run (at least partially) within this time range.” Report Acceleration also automatically shares report acceleration summaries with similar searches, automatically backfills gaps in data, and stores the summaries in buckets in indexes. But this can only be performed on searches that use streaming commands. Whereas summary indexes are created at specific intervals, the accelerated buckets are created almost continuously.

TSIDX Search (TSTATS)

The other option for faster searching is still not officially supported by Splunk—but is actually used every time you run a search: searching time series index files, or tsidx files. As Splunk indexes your data over time, it creates multiple tsidx files. These files are appended with .tsidx and are archived in buckets alongside corresponding .data (raw data) files broken into events based on timestamps. The tsidx files register all of the keywords in your data (error codes, response times, and so on), where each keyword is paired with a set of location references to raw data events that use that keyword. When you run a search, Splunk searches the tsidx files for the keywords and retrieves the associated events from the referenced raw data file.

Using the tscollect command, it is possible to create tsidx files containing only the fields one is interested in. As these files contain only the fields specified, and are stored on the searchhead, this is the fastest of the accelerated options, but tstats has only a subset of all the searching commands available. Similar to creating summary indexes, the tstats searches are dependent on how often the tscollect namespaces are built.

How to search?

Searching through tsidx files that you create involves two simple steps.

Step 1: Use the tscollect command to create a namespace. A namespace is the datacube that you define so you can search inside it. It will need to contain all the information to create the desired search result.
The default location for namespaces created is the tsidxstats directory under $SPLUNK_DB; An alternative location can be set by specifying in indexes.conf stanza [default]tsidxStatsHomePath = <path on server>

TSCOLLECT:
Use tscollect to create the *.tsidx ﬁles

tscollect [namespace=string] [squashcase=bool] [keepresults=bool]
index=twitter | table _time hashtag location mention retweet_count user | tscollect namespace=twitter squashcase=true
squashcase=true changes all letters to lowercase

Step 2: Use the tstats command to search the namespace. In the command, you will be calling on the namespace you created, and the fields in which you are interested for that particular search.

TSTATS:
Use tstats to search in the namespace created

– Begin searches with <|tstats>

– SQL-like queries

– Not all search commands available

– Supports in-command

– ﬁltering via WHERE

– grouping via GROUPBY

– stats subcommands with “prestats=t” qualifier

| tstats prestats=true | <stats|chart|timechart>

– Except when using prestats=t and append=t, tstats must be the first command in a search

|tstats (aggregate) from <namespace> [where exact_query] (groupby field-list [span=timespan]))

So yes, there's a way to speed things up; it just depends on what exactly you are trying to do. Once you assess your environment, and your data uses to determine the details, it's pretty simple to set up a tsidx datacube and search only on those files to see results 60 to 100 times faster than a regular search.

I hope you will find this useful in using tstats searches. If you need any help, reach out to us at Function1!

Tags: search, speed, Splunk, tsidx

Blog