Expand your Toolkit: Troubleshooting Splunk from the GUI with REST API

image

Recently, I fixed a malfunctioning email from Splunk. It included a PDF of a dashboard, “a scheduled PDF.” The dashboard wasn’t documented and its creator had recently left the company, so I was tasked with investigating something unfamiliar. The email was a 24-hour snapshot used by executives. The result was low insight into a high-visibility problem that required analysis and resolution.
 

This investigation used Splunk’s REST API to review attributes from the GUI in addition to internal logs. This approach makes viewing properties of knowledge objects and configuration files accessible in the same mode. It lessens the need to view files on the filesystem. This should help to investigate issues faster and it works for those without access to the file system.
 

I addressed the issue by completing the following steps:

  • Investigating the error

  • Reviewing configuration file attributes

  • Investigating the source and resolution

  • Testing the approach

Investigating the Error
Two errors revealed the trigger of the problem:

  1. Splunk's Python log showed an error message at the time of the email:

ERROR     __init__:477 - Socket error communicating with splunkd (error=The read operation timed out)”.

  1. The access log also showed that pdf generation took 110 seconds:

127.0.0.1 - 99999 [01/Feb/2017:06:00:54.080 -0500] "GET /services/pdfgen/render?now=1484910000&owner=99999&sid=scheduler__99999_… &namespace=snapshot_app&input-dashboard=snapshot_supplement&paper-size=letter HTTP/1.0" 400 320 - - - 110415ms
 

Configuration Attribute Review

I needed to review whether the limit of the splunkd timeout is less than 110 seconds. Web.conf holds the splunkConnectionTimeout attribute. I used the Admin manual to check the web.conf.spec default and used the REST API to check the system setting of the attribute. The splunkConnectionTimeout was still the  30-second default. The 110-second PDF generation was causing the error.

 

I reviewed the conf-file setting with the following commands:

web.conf

| rest /servicesNS/-/-/configs/conf-web/settings

splunkdConnectionTimeout=30
 

I had a few options at this point: I could adjust the timeout attribute or dig into the source for the issue. I don’t believe in adjusting a global setting to fix a local problem, so I had more work to do.

 

Investigating the Source and a Design Pattern Resolution

As I opted to keep the global limits the same, I needed to address the local issues related to the render time of the searches powering this specific dashboard. Splunk has several options for managing time related to dashboards. Splunk's Dashboards and Visualizations manual states "Use scheduled reports for dashboard panels when possible." When a user calls a scheduled report, Splunk returns the most recent report instead of running a new one (as with an ad hoc search). As the report is already saved, its search has negligible impact on a timeout window.

 

This dashboard had 5 long-running searches that were written into the dashboard. So, each one needed to run and complete before the PDF could be generated. I decoupled the searches from the dashboard and PDF generation by turning each search into an scheduled report. I scheduled the PDF delivery for 5 minutes after the reporting period. Even without accelerating the reports, I was now under the timeout. I added acceleration because the searches had transforming commands.

 

The time to run the dashboard and generate the PDF dropped from 100-seconds to 4 with this decoupled approach. Only the PDF needs to be generated in the 30-second window. This change helped me to avoid unexpected, changing global properties and unintended consequences.

Test and Check Attributes
Next, I rescheduled the email to test generation and the delivery. I used the REST API to check that my dashboard was scheduled appropriately with the scheduled view endpoint:

| rest /servicesNS/-/-/scheduled/views/{name}

Additionally, I reviewed the dashboard properties via the REST API as well.

| rest /servicesNS/-/-/data/ui/views/{name}

Conclusion
This simple, yet important email turned out to be more involved than expected. My initial goal was to avoid a timeout, but given the audience, I opted for a robust solution. Systems need updates and revisions eventually. It is a good time to look a bit deeper to make improvements that eventually give back.

Resources

http://docs.splunk.com/Documentation/Splunk/latest/Viz/DashboardPDFs

http://docs.splunk.com/Documentation/Splunk/latest/Admin/Webconf

http://docs.splunk.com/Documentation/Splunk/latest/RESTREF/RESTknowledge

http://docs.splunk.com/Documentation/Splunk/latest/Report/Acceleratedrep...

Subscribe to Our Newsletter


Stay In Touch