An introduction to Splunk Search Processing Language
Splunk offers an expansive processing language that enables
a user to be able to reduce and transform large amounts of data from a dataset, into specific and relevant pieces of information. The Search Processing Language (SPL) is vast, with a plethora of Search commands to choose from to fulfill a wide range of different jobs.
My goal in this blog is to introduce the user to the basic SPL format, and to the different types of Splunk Search Commands. I also aim to help decide which type of command would best suit the problem that you are facing.
Anatomy of a Search
Search Pipeline
The “search pipeline” refers to the structure of a Splunk search, which consists of a series of commands that are delimited by the pipe character (|). The pipe character inputs the results of the last command to the next, to chain SPL commands to each other.
Generally, searches are comprised of commands piped to another command to help reduce and formulate the results into something that we want.
A Splunk search starts with search terms at the beginning of the pipeline. These search terms are keywords, phrases, boolean expressions, key/value pairs, etc. that specify which events you want to retrieve from the index(es).
The retrieved events can then be passed as inputs into a search command using a pipe character, which would be transformed into the results that you need.At the beginning of a search pipeline, the search command is implied, even when you don’t explicitly state it. So if you immediately type: host=”localhost”, it is completed as search host=”localhost”
Fields
Events and results flowing through the Search pipeline exist as a collection of fields, which fundamentally comes from the data. The fields contain value strings relevant to specific events in the data and could be used alongside search commands to filter out data. Fields can come from the Index or from a wide range of sources at search time such as tags, regex extractions, event types, etc. For a given event, a field name might be present or absent, if present it might contain a single or multiple string values.
Certain important fields are index, _time, host, source, and _raw.
Some notable fields are:
Null: A field that is not present on a particular result or event. Other events or results in the same search might have values for this field.
Empty Field: A field that contains a single value that is the empty string.
Empty value: A value that is the empty string, or “”. You can also describe this as a zero-length string.
Multivalue Fields: A field that has more than one value. All non-null fields contain an ordered list of strings. The common case is that this is a list of one value. When the list contains more than one entry, it is a multivalue field
Quotes and Escape Characters
Quotes are used in situations that require a whole string to be evaluated. You will need quotes around phrases and field values that include white spaces, commas, pipes, quotes, or brackets. Quotes must be balanced.Escape character (\) is used to escape quotes, pipes, and itself from being evaluated.
General SPL Components
When writing an SPL command, there are a few components to the search that could be used to help filter or format the results. Generally, searches within SPL have a combination of the below components.
Search Terms
The search terms contain certain keywords or phrases to help filter out what we want in our results. Certain search terms could be the name of the fields that we want, certain indexes we are interested in, or certain criteria that needs to be met.
Commands
Commands are certain actions you want to take on the results, such as formatting, filtering, altering, sorting, counting, renaming, or generating commands. There is a wealth of search commands that we could use, and more will be discussed in the rest of the blog.
Functions
Along with commands, search functions are used for specifying what sort of computation will be done in certain fields. Functions are usually used alongside statistical commands, such as stats. Some examples of functions include: avg(), sum(), median(), min(), max(), mean(), var().
Clauses
Clauses help group or rename fields in the result to help format the results. Some common clauses are the “BY” clause which sorts the results by a certain field, the “AS” clause used for renaming, and the “WHERE” clause used for sorting or filtering.Some useful clauses used in filtering results include the “AND” and “OR” clauses, these clauses are generally used with search terms to specify which terms will be included. If there is no clause provided at the beginning of a search, the “AND” clause is automatically used.
Arguments
Splunk commands have arguments that are either optional or required. Required arguments are necessary to allow the commands to work, and generally, return an error when not provided. Arguments require either a field name, value, or boolean value. Command arguments sometimes have default values in case a value isn’t specified.
Sub-Searches
Example
The following example shows how we can use some of the different components and the anatomy we have previously talked about to make a search:
A subsearch runs its own search and returns the results to the parent command as the argument value. The subsearch is run first before the command and is contained in square brackets. This type of search is generally used when you need to access more data or combine two different searches together.
An example of a sub-search in a command is:
union [search index=a | eval type = “foo”] [search index=b | eval mytype = “bar”]
Some examples of the above components in this example are:
Search Terms: index=”access_combined”, index=”main”
Clause: OR,by
Functions: avg()
Commands: stats, dedup, head
Argument: keepevents=true
Types of Commands
There are six different types of search commands that a user can use: distributable streaming, centralized streaming, transforming, generating, orchestrating, and dataset processing.
Distributable Streaming
A distributable streaming command is a command that runs on the indexer or search head, depending on where in the search that the command is invoked. This allows the commands to run subsets of indexed data in parallel, speeding up the execution of the command greatly. Examples of data distributable streaming commands include: convert, eval, fields, regex, and rename.
Centralized Streaming
A centralized streaming command applies a transformation to each event returned by a search on the search head. Unlike a distributable streaming command, it cannot run the command on indexers, meaning that there is less parallelization that could be utilized on it.Examples of data distributable centralized commands include: dedup, head, join, and transaction
Transforming
A transforming command orders the results into a data table. These commands alter the values for each event into numerical values for Splunk software can use for statistical purposes. These commands are required to transform search result data into the data structures that are required for visualizations such as charts and tables.Examples of transforming commands include: chart, timechart, stats, top, and rare
Generating
A generating command is a command that generates data from the indexers, without any prior transformations. Generating commands don’t expect or require an input, and are usually invoked at the beginning of the search with a leading pipe. That is there cannot be any command that is piped into a generating command. They are either event-generating (distributable or centralized) or report-generating. Depending on the command used, the results are returned as a list or a table.Examples of generating commands include: dbinspect, datamodel, inputcsv, metadata, pivot, and search
Orchestrating
An orchestrating command is one that does not directly affect the end result of the search but controls some aspects of how the search is processed. Orchestrating commands are generally used to help optimize the search so that the search completes faster.Examples of orchestrating commands include redistribute, noop, and localop
Dataset Processing
A dataset processing command is one that requires the entire dataset before the command can run. These commands are not transforming, non-distributable, non-streaming, and non-orchestrating. Examples of data processing commands include : sort, eventstats, some modes of cluster, dedup, and fillnull.
Streaming Commands vs. Non-Streaming Commands
There are two ways that commands can ingest data, either streaming the data or waiting for the data to be fully available before utilizing them. These two methods of waiting for data are organized into two categories, Streaming Search Commands, and Non-Streaming Search Commands. Streaming Search commands are commands in which the command operates on each event as it comes in, and has one input and one or no outputs. This type of command is run on indexers and can be applied to subsets of index data in a parallel fashion as long as it’s not preceded by a non-streaming search command. Non-streaming search commands are commands that run on the search head and requires that all of the events are gathered from the indexers before running. An example of a non-streaming search command is the “sort” command, which requires all of the data to be retrieved before it can be sorted correctly.
Tips, Tricks, and Best Practices
Knowing your search goals
Knowing which goal you want your search to accomplish can help you optimize searches.For searches in which we want to retrieve data, when retrieving raw events from an index, no additional processing of the events is done before being retrieved, so being as specific as we can speed up searches. You could do this with keywords and field-value pairs that are unique to the events. When you want to retrieve events that occur frequently, the search is referred to as a dense search, if the event is rare in the dataset, it is known as a sparse search. Sparse searches that run against large volumes of data take longer than dense searches since it takes longer to find those events.When running a search that generates a report that summarizes or organizes data, it would be best to be more restrictive and specific when retrieving data, since the data is going to be stored and processed within memory.
Using non-streaming search commands as late as possible
Another way to speed up search execution is considering where to place non-streaming search commands. Placing non-streaming search commands as late as possible in your search string helps optimize searches. This is because using non-streaming searches early in the search reduces parallel processing since before a non-streaming search command, commands could be run on the indexers in parallel. Since a non-streaming command requires all of the events to be present in the search head before operating on them, all of the data will be sent to the search head, and every subsequent command that would be ran on the indexers would be ran on the Search head.
Limiting the time-range
Another way to speed up searches is to limit the time range to be as small as possible. This helps cut down on the number of events that need to be processed in the subsequent commands.
Using fields filtering effectively
Using Indexed and default fields to filter out your data as soon as possible helps speedup searches since filtering out data means that less data needs to be processed later on in the pipeline.
Commonly Used Commands and Functions
Common Search Commands
Command | Description |
---|---|
Dedup | Removes duplicate results that match a certain criteria |
eval | Calculates an expression, see examples below |
fields | Removes fields from search results, can specify what fields we want |
head/tail | Returns the top/bottom N results |
lookup | Adds field values from an external source such as a lookup table |
chart/timechart | Returns results in a tabular format, such as a time chart of bar chart |
rename | Renames a field, use wildcards for multiple fields |
rex | Specifies a regular expression named groups to extract fields from results |
search | Filters results to those that match the search expression |
sort | Sorts the results by the specified field. Can be ascending or descending |
stats | Provides statistics, can be grouped by fields. See examples below |
top/rare | Displays the most/least common values in a field. Can be useful for grouping |
where | Filters search results using eval expressions. Used to compare two different fields |
table | Specifies fields to keep in the result set, and retains data in a tabular format |
Common Eval Functions
Function | Description |
---|---|
abs(x) | Returns absolute value of x |
case(x,”y”,…) | Consumes pairs of arguments X and Y, where X arguments are Boolean expressions. When evaluated to TRUE, the arguments return the corresponding Y argument. |
ceil(x) | Ceiling of number x |
cos(x) | Cosine of x |
exact(x) | Evaluates an expression x using double precision floating point arithmetic. IE exact(3.14*num) |
exp(x) | Returns eX |
if(x,y,z) | If X evaluates to TRUE, the result is the second argument Y. If X evaluates to FALSE, the result evaluates to the third argument Z |
isbool(x) | Returns true if X is a boolean |
isint(x) | Returns true if X is an integer |
isnull(x) | Returns true if X is null |
isstr(x) | Returns true if X is a string |
len(x) | Returns the character length of X |
log(x,y) | Takes the log of the X using the base of Y |
match(x,y) | Returns if X matches the regex pattern Y. |
max(x, y, …) | Returns maximum |
min(x, y, …) | Returns minimum |
md5(x) | Returns the MD5 hash of a string value X. |
mvcount(x) | Returns the number of values of X |
now() | Returns the current time, represented in Unix time |
null() | Returns null |
random() | Returns a random number from 0 to 2147483647 |
replace(x,y,z) | Returns a string formed by substituting string Z for every occurrence of regex string Y in string X |
round(x,y) | Returns X rounded to the amount of decimal places specified by Y. The default is to round to an integer |
split(x,”y”) | Returns X as a multi-valued field, split by delimiter Y |
time() | Returns the wall-clock time with microsecond resolution |
sqrt(x) | Returns the square root of X |
tonumber(x,y) | Converts input string X to a number, where Y (optional, defaults to 10) defines the base of the number to convert to |
tostring(x,y) | Returns a field value of X as a string. If the value of X is a number, it reformats it as a string. If X is a Boolean value,, reformats to “True” or “False”. If X is a number, the second argument Y is optional and can either be “hex”, “commas”, or “duration” |
typeof(x) | Returns a string representation of the field type |
urldecode(x) | Returns the URL X decoded |
Common Stats Functions
Stats Function | Description |
---|---|
avg(x) | Returns the average of the values in X |
count(x) | Returns the number of occurrences of the field X |
dc(x) | Returns the count of distinct values in X |
earliest(x) | Returns the earliest seen value of X |
latest(x) | Returns the latest seen value of X |
max(x) | Returns the max value within field X. If the values of X are non-numeric, the max is found from alphabetical ordering |
min(x) | Returns the min value within field X. If the values of X are non-numeric, the min is found from alphabetical ordering |
median(x) | Returns the middle most value of field X |
mode(x) | Returns the most frequent value of field X |
perc<x>(y) | Returns the X-th percentile value of the field Y |
range(x) | Returns the difference between the max and min values of the field X |
stdev(x) | Returns the sample standard deviation of the field X |
sum(x) | Returns the sum of the values of the field X |
sumsq(x) | Returns the sum of the squares of the values of the field X |
values(x) | Returns the list of all distinct values of the field X as a multi-value entry. The order of the values is alphabetical |
var(x) | Returns the sample variance of the field X |
Conclusion
As you can see, there is a lot that can go into searching for specific data within Splunk, and there are a lot of methods that you could learn to optimize your search. I hope that you come away from this blog with a basic understanding of Splunk commands, and where to start to orchestrate and run your own searches.
Resources
SPL quick reference: Documentation Link
Types of commands: Documentation Link
Command Types: Documentation Link
SPL Commands by category: Documentation Link
Anatomy of a search: Documentation Link
Quick Reference Guide: Documentation Link
Write better searches: Documentation Link
Also Read: Understanding Splunk Architectures and Components
Also read: Remediate Security Vulnerabilities in npm/Yarn Dependencies
Author
SEAN MALLOY
Sean Malloy is working as an Automation Engineer at Crest Data. Sean has worked on multiple automation and 508 Compliance projects for Splunk. Before joining Crest, Sean worked as an intern twice at SAP and has led multiple projects as part of his internship for Machine Learning and web development. Sean holds a Bachelor’s degree from UC Davis.