All Aboard! On-boarding Data Sources into Splunk

By: Anshu October 20, 2015

All Aboard!

On-Boarding Data Sources into Splunk

Introduction

An initial Splunk deployment is like a small train. Its engine is complex and powerful but it’s only pulling a couple of train cars. As Splunk deployments grow, it’s necessary to on-board more data into Splunk just as cars would be added to a train. This post is relevant to organizations that want to develop more mature processes and governance around their Splunk deployment. It is written from the perspective of the group in charge of on-boarding data into Splunk.

A typical scenario we see in the field is a deployment that is moving beyond the proof of concept phase and being adopted as a departmental or enterprise-wide data platform. This requires working with different groups within the organization, mainly system and application owners, in order to on-board data within Splunk. These are like product manufactures that wish to ship goods via train. Defining a process for on-boarding data leads to efficiency, since the information and actions needed from data owners is known upfront and the bottlenecks are clear.

The process below assumes that the need for a data source in Splunk has been vetted by the gatekeeper or “conductor” of the Splunk deployment. The process is also tailored towards on-boarding data via Splunk forwarders. On-boarding data from appliances is similart but does introduce some differences.

Step 1: Kick-off Meeting with Stakeholders

The first step in on-boarding data into Splunk is to have a kick-off meeting with the data owners. During this meeting the following topics should be covered:

The organization’s process for on-boarding data into Splunk
The information required by the data owners (detailed in Step 2)
The actions required by data owners

Step 2: Collect the Necessary Information

There are several pieces of information that are needed in order to on-board data into Splunk. Each piece of information starts a sub-process in getting the data into Splunk. Below is a list of that information:

Data Input 1: Samples of all Data Sources to be Ingested

The data owners should be requested to provide samples of all data sources to be ingested. Getting samples of the data sources is necessary so that index-time (i.e. timestamp extraction and event-breaking) and search-time (i.e. field extractions and tagging) configurations can be created. Note that it may not be necessary to get samples of well known data sources such as Windows Security events since there are add-ons available on splunkbase.com that can parse these data sources.

Data Input 2: List of all Log File Paths and Log File Names to Collect

The data owners should provide this information in order to create the inputs configuration package that will be deployed to the Splunk forwarders to collect this data.

Data Input 3: Identify Systems to be Used for Testing

The data owners should identify test systems that will be used to verify configuration and connectivity to the Splunk environment.

Data Input 4: List of All Servers to Collect Data From

The data owners should provide this list so that Splunk forwarders can be installed and added under the control of the Splunk deployment server. This will also help validate that data is coming in from all servers after forwarders have been configured.

Step 3: Sub-Processes

What we discovered during the development of this process is that each of these data inputs is used to begin a sub-process for on-boarding the data. The advantage here is that not all of the data has to be gathered in order to start the data ingestion process and the sub-processes can be worked on simultaneously and by different people.

Sub-Process A: Create Data Source Configuration

This sub-process requires the log samples (Data Input 1 from above). During this phase, Splunk configuration is created in a development environment.

The log samples are analyzed to determine sourcetype, timestamp, and event boundaries. Sourcetypes are key parts of data source configuration and well known data sources should use standard sourcetypes. One way to discover this is search splunkbase for an appropriate add-on and examine which sourcetypes are being used.
Index-time configuration is created to extract timestamps and define event boundaries. Although Splunk can do this automatically for many data sources, it’s best practice to define this configuration explicitly so that data is indexed more efficiently and with less errors.
Search-time configuration is developed. Although it’s not mandatory to do so during this phase, developing some initial field extractions or tags will help ensure that the data is more usable once it becomes available to users
The configuration package created or validated during this process is deployed to the Test Splunk Environment.

Sub-Process B: Create Inputs Configuration Package

Create the inputs configuration package using the list of log files to collect (Data Input 2 from above).

Determine what sourcetype to set for the data at input time. (See Step A.1 from above)
Determine which index the data will be sent to. (You can refer to one of our previous blog's covering this topic: http://www.function1.com/2012/09/organizing-your-splunk-shoe-rack-defini...)
Create the actual inputs.conf configuration. This will include the log file path(s) to monitor, index, and sourcetype.

Sub-Process C: Collect Data from the Test System

Install the Splunk universal forwarder on the Test servers.
Configure the forwarder to communicate with the appropriate Splunk Deployment Server.
Configure the forwarder to send data to the indexer(s) in the Test environment. This is done by deploying a configuration package containing the list of indexers in outputs.conf.
(After Sub-Process A and B are completed) Deploy the inputs configuration package to the Test system.
Validate that data is being indexed correctly and search-time configuration is accurate. Go back to sub-process A for remediation.

Sub-Process D: Collect Data from Production Systems

Install Splunk Universal forwarder on Production servers. This is impacted by the number of servers to deploy servers to and the change management process within the organization.
Configure the forwarder to communicate with the appropriate Deployment Server.
Configure the forwarder to send data to the indexer(s) in the Production environment.
(After Sub-Process C is completed) Deploy the data source configuration from Sub-Process A onto the production Splunk indexers and search heads.
Deploy the inputs configuration package from Sub-Process B to a subset of the Splunk forwarders residing on the Production systems.
Validate data in the Production environment.
Deploy the inputs configuration to the remaining Production servers to collect data from.

Conclusion

Working with different groups within an organization to on-board data into Splunk can be challenging. Defining a repeatable process to do so can help make the ride smoother for all involved.

Tags: Splunk, Governance, Data On-boarding

Blog