Anonymizing Data in Splunk

By: Anshu June 02, 2015

Introduction

In this blog we'd like to discuss masking or obscuring data in Splunk. We’ve had customers in the past ask us how to mask data at both search and index-time. Usually this is to hide personally identifiable information either for security, compliance or both. In this post we’ll cover several different approaches for doing this in Splunk and discuss some pros and cons.

For each of the approaches we will use the following sample data from a fictitious HR application:

sourcetype = hr_app
sample event = “This is an event with a sensitive number in it. SN=111-11-1111. This number should be masked”

Transforms.conf

In this approach, a TRANSFORMS statement is called in the props.conf file and is applied to the data in the queues before being indexed. In the example, the goal is to mask the “sensitive number" except for the last 4 digits.

—props.conf---
[hr_app]
TRANSFORMS-hr_app_logs_mask_data = mask_sn

—transforms.conf---
[mask_sn]
REGEX = (?m)^(.*)SN=\d{3}-\d{2}-(\d{4}.*)
DEST_KEY = _raw
FORMAT = $1SN=###-##-$2

This is the result of the sample event going through the transformation
“This is an event with a sensitive number in it. SN=###-##-1111. This should be masked”

The approach here is to match the first part of the event (.*), then the part to be masked (SN=…), then the last 4 digits and the rest of the event. These last two parts are to be retained when the event data is written back out to the "_raw" field specified by the "DEST_KEY." Note that the “FORMAT” setting specifies how the event will be re-written. The "$1" and "$2" refer to the two capturing groups in the "REGEX" field.

SEDCMD

Splunk exposes a SEDCMD feature that can be used at index-time. Taking the example above, the following would work

—props.conf—
[hr_app]
SEDCMD-hr_app_logs_mask_data = s/SN=\d+-\d+-(\d+)/SN=###-##-\1/

This will produce the same result as above. There’s a few advantages to this approach:

It only involves one configuration file (props.conf) instead of two.
The matching expression is simpler and doesn’t need to match the entire event like the one in transforms.conf does
multiple expressions can be chained together

SCRUB

The “scrub” command is an interesting one. It anonymizes data at search-time based on configuration that is shipped with Splunk. Note that this command only takes effect at search-time and therefore any sensitive data would still be stored on disk, “at-rest” on the indexer. The cool part about this feature is that it can use an existing dictionary of terms to anonymize data but keep the format of the data intact. For example, an e-mail address like "john@mail.com" would become "joe@abc.com." The latter is an invalid address but retains the format of an e-mail.

Reference Link: http://docs.splunk.com/Documentation/Splunk/6.2.3/SearchReference/Scrub

ENCRYPTION/DECRYPTION

Finally, another option we found interesting is an app on Splunkbase to encrypt and decrypt data. In this approach, existing data in Splunk is encrypted and then re-indexed in Splunk, then at search-time is decrypted.

Link to the app on Splunkbase: https://splunkbase.splunk.com/app/282/

Hopefully this helps shed some light on the options for anonymizing or masking data in Splunk. Thanks for reading!

Tags: Splunk, Mask, Anonymizing, Data, Search-time, Index-time

Blog