Skip to main content

Adoption Guide: Basics of GoStaash-Parsing

  • January 13, 2026
  • 0 replies
  • 137 views

Digital-Customer-Excellence
Staff
Forum|alt.badge.img+7

Author: Darren Davis

 

Back in the early days of Google SecOps, when defining how we would map various raw log fields to a unified dataset, we determined that a customized variation of the data processing tool Logstash would be the best course of action. Lovingly code-named “GoStash”, we sought to create a simplified yet equally powerful fork of the original language, which has transformed into the modern-day normalization language.

 

What are the components of GoStash?

 

The GoStash process is broken down into 3 main categories: Extraction, Manipulation, and assignment. We will start with extraction, and will introduce the properties from what I consider to be easiest to hardest.

 

Data Extraction

 

JSON Extraction

 

This filter is designed specifically for logs already structured in JavaScript Object Notation (JSON). It automatically parses key-value pairs and even deeply nested objects with seamless efficiency.

 

  • Operational Mechanism: The process involves reading the message field (or any other specified source) and expanding the JSON structure into accessible tokens.
  • Handling Arrays: To access discrete items within a JSON array—such as a list of IP addresses—the crucial parameter array_function => "split_columns" must be used.
    • This technique generates index-based tokens

 

Why This is the Easiest (in my opinion)

 

The log, being structured, is self-describing. The JSON filter typically stands out as the easiest because the data is already organized into discernible key-value pairs and nested objects. Conventionally, one simply points the filter toward the message field, and it automatically expands every key into a readily usable token.

 

The Sole Complexity: Navigating arrays (lists). Should the JSON structure incorporate lists, one must either implement the array_function => "split_columns

 

KV (Key-Value Pair Extraction)

 

The Key-Value (KV) filter is designed for parsing semi-structured logs where the data is articulated in explicit pairs, such as key=value or key:value.

 

  • Operational Mechanism: The core process involves segmenting the log string based on designated delimiters to isolate key-value structures.
  • Configuration Parameters:
    • field_split: Specifies the character utilized to demarcate distinct key-value pairs (e.g., a space or a pipe |).
    • value_split: Identifies the character responsible for separating the key from its corresponding value (e.g., = or :).
    • trim_value: A function that removes non-essential characters, such as quotation marks, surrounding the extracted value.
    • whitespace: Allows for configuration to "strict" or "lenient," thereby dictating the parser's interpretation of spacing adjacent to the defined delimiters.

 

CSV (Delimited Column Extraction)

 

This filter is designed to parse logs articulated in comma-separated values (CSV) or any other delimiter-separated formats, treating the raw log data as a structured sequence of columns.

 

  • Operational Mechanism: The fundamental process involves segmenting the initial message field based on the specified separator, typically a comma.
    • NOTE: This does not have to be a comma, as long as the separator is consistent, it can vary.
  • Output Paradigm: The extracted data values are automatically assigned to generic internal variables denoted as column1, column2, column3, and so forth.

 

XML (Hierarchical Data Extraction)

 

This filter is specifically utilized for logs articulated in the Extensible Markup Language (XML) format. Its operation relies on the comprehensive XPath syntax to precisely locate specific data nodes within the log structure.

 

  • Operational Mechanism: The process involves defining the source field and systematically mapping distinct XPath locations to user-defined variable names.
    • Example: /Event/System/EventID => eventId.
  • Iterative Functionality: Similar to JSON parsing, the XML parser supports iteration, enabling the traversal of repeated XML tags, such as a list of hosts.
    • Syntax: for index, _ in xml(message, /Event/Path/To/List).
    • The parser dynamically generates an index (commencing at 1) which is subsequently instrumental in the dynamic extraction of data from each constituent child node.

 

Grok (Unstructured Data Parsing)

 

Grok serves as the primary and most robust method for the systematic parsing of unstructured data, most notably standard syslog messages. It functions by seamlessly integrating predefined patterns with regular expressions to precisely match lines of text and extract critical data values into usable tokens. This is why it is last, as it can be the most difficult to master.

 

  • Operational Mechanism: The methodology requires the explicit definition of a pattern meticulously crafted to match the log's inherent structure. Upon a successful match, the relevant data is extracted and mapped directly to a specified variable.
  • Syntax: Grok utilizes the standard format %{PATTERN_NAME:variable_name}.
    • Illustrative Instance: %{IP:hostip} executes the extraction of an IP address and subsequently stores it within a token named hostip.
  • Integration of Regular Expressions: The system permits the integration of custom regular expressions, which must be formatted as (?P<token_name>regex_pattern). It is important to note that within this specific syntax, the use of double backslashes (\\) is mandatory for all escape characters, replacing the conventional single backslash.
  • Overwrite Functionality: Grok is equipped with an overwrite parameter, which enables the utility to supersede the value of an existing field (such as the original message content) with a newly extracted data value.

 

Core Examples of Data Manipulation

 

JSON

 

The Syntax:

 

json {

source => "@var"

array_function => "split_columns" #optional
target => "@var2" ##this is not standard practice, but is possible/OPTIONAL

}

 

@var indicates the variable that you are attempting to extract the json object from. This will typically be “message”, as all raw logs in the SecOps instance are referred to in data as message=@raw_log. You can make @var be the token name of the nested json object.

 

Array_function is a unique argument to Gostash. It allows you to break apart the arrays that are within the JSON object.

 

Here is an example of the json output in work. Consider the raw log:

 

{"timestamp":"2025-12-08T18:53:40.456Z","log_level":"INFO","event_type":"authentication","action":"logout","result":"success","user":{"username":"SystemAdmin","user_id":1,"role":"admin"},"session":{"session_id":"xuvcrjjjUk","ip_address":"194.206.109.196","user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"},"app_details":{"application":"Spellcaster-game-app","version":"1.0.3","pid":12345,"server":"SpellCaster-server-main-01"},"performance":{"latency_ms":73}}

 

The JSON object is message, so the resulting code block will simply be

 

json {

source => "message"

array_function => "split_columns"

}

 

The resulting textproto that results is as follows (cleaned & highlighted for this guide):

 

{

"@collectionTimestamp": {

"nanos": 0,

"seconds": 1765852715

},

"@createTimestamp": {

"nanos": 0,

"seconds": 1765852715

},

"@timestamp": {

"nanos": 0,

"seconds": 1765852715

},

"@timezone": "",

"action": "logout",

"app_details": {

"application": "Spellcaster-game-app",

"pid": 12345,

"server": "SpellCaster-server-main-01",

"version": "1.0.3"

}

"event_type": "authentication",

"log_level": "INFO",

"message": "{\"timestamp\":\"2025-12-08T18:53:40.456Z\",\"log_level\":\"INFO\",\"event_type\":\"authentication\",\"action\":\"logout\",\"result\":\"success\",\"user\":{\"username\":\"SystemAdmin\",\"user_id\":1,\"role\":\"admin\"},\"session\":{\"session_id\":\"xuvcrjjjUk\",\"ip_address\":\"194.206.109.196\",\"user_agent\":\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\"},\"app_details\":{\"application\":\"Spellcaster-game-app\",\"version\":\"1.0.3\",\"pid\":12345,\"server\":\"SpellCaster-server-main-01\"},\"performance\":{\"latency_ms\":73}}\n",

"reason": "",

"result": "success",

"session": {

"ip_address": "194.206.109.196",

"session_id": "xuvcrjjjUk",

"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

},

"timestamp": "2025-12-08T18:53:40.456Z",

"user": {

"role": "admin",

"user_id": 1,

"username": "SystemAdmin"

}

}

 

Here is an example of using the array_function argument:

 

The raw log example is similar, now with an array of ip addresses within the session_details nested json object:

 

{"timestamp":"2025-12-08T18:53:40.456Z","log_level":"INFO","event_type":"authentication","action":"logout","result":"success","user":{"username":"SystemAdmin","user_id":1,"role":"admin"},"session":{"session_id":"xuvcrjjjUk","ip_address":["194.206.109.196","203.0.113.45","198.51.100.12"],"user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"},"app_details":{"application":"Spellcaster-game-app","version":"1.0.3","pid":12345,"server":"SpellCaster-server-main-01"},"performance":{"latency_ms":73}}

 

After we apply the same function, we receive this textproto:

 

{

"@collectionTimestamp": {

"nanos": 0,

"seconds": 1765853103

},

"@createTimestamp": {

"nanos": 0,

"seconds": 1765853103

},

"@timestamp": {

"nanos": 0,

"seconds": 1765853103

},

"@timezone": "",

"action": "logout",

"app_details": {

"application": "Spellcaster-game-app",

"pid": 12345,

"server": "SpellCaster-server-main-01",

"version": "1.0.3"

},

"event_type": "authentication",

"log_level": "INFO",

"message": "{\"timestamp\":\"2025-12-08T18:53:40.456Z\",\"log_level\":\"INFO\",\"event_type\":\"authentication\",\"action\":\"logout\",\"result\":\"success\",\"user\":{\"username\":\"SystemAdmin\",\"user_id\":1,\"role\":\"admin\"},\"session\":{\"session_id\":\"xuvcrjjjUk\",\"ip_address\":[\"194.206.109.196\",\"203.0.113.45\",\"198.51.100.12\"],\"user_agent\":\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\"},\"app_details\":{\"application\":\"Spellcaster-game-app\",\"version\":\"1.0.3\",\"pid\":12345,\"server\":\"SpellCaster-server-main-01\"},\"performance\":{\"latency_ms\":73}}",

"performance": {

"latency_ms": 73

},

"reason": "",

"result": "success",

"session": {

"ip_address": {

"0": "194.206.109.196",

"1": "203.0.113.45",

"2": "198.51.100.12"

},

"session_id": "xuvcrjjjUk",

"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

},

"timestamp": "2025-12-08T18:53:40.456Z",

"user": {

"role": "admin",

"user_id": 1,

"username": "SystemAdmin"

}

}

 

 

Notice that we now see the ip_address array broken apart into the json token pair of array_index:value. This provides much more control over data transformation.

 

It is possible to perform additional JSON extractions on nested JSON objects. This can get tricky, as the key tokens that are within the nested JSON can oftentimes be the same as top-level key tokens. This can become destructive, as the JSON extraction will simply do as described; the JSON extraction will not create unique key tokens, and could overwrite keytokens from the top level. Keep this in mind when you perform additional json extractions.

 

KV

 

The Syntax:

 

kv {

source => "@var"

field_split => " " (defaults to space)

value_split => "=" (defaults to =)

trim_value => "\"" optional

whitespace => "strict" optional

}

 

@var indicates the variable that you are attempting to extract the key-value pairs from. This will typically be "message", but can be any field containing delimited data.

 

field_split allows you to specify the delimiter that separates each key-value pair, such as a space or a pipe character (|).

 

value_split identifies the delimiter between the key and the value, such as an equal sign (=) or a colon (:).

 

trim_value removes extraneous leading and trailing characters from the value, such as quotation marks.

 

Here is an example of the kv output in work. Consider the raw log:

 

timestamp=2025-12-08T18:53:40Z action=allow src_ip=10.0.0.1 dst_ip=8.8.8.8 protocol=TCP user=jdoe

 

The KV string is message, so the resulting code block will simply be:

 

kv {

source => "message"

field_split => " "

value_split => "="

}

 

The resulting textproto that results is as follows (cleaned & highlighted for this guide):

 

{

"action": "allow",

"dst_ip": "8.8.8.8",

"message": "timestamp=2025-12-08T18:53:40Z action=allow src_ip=10.0.0.1 dst_ip=8.8.8.8 protocol=TCP user=jdoe",

"protocol": "TCP",

"src_ip": "10.0.0.1",

"timestamp": "2025-12-08T18:53:40Z",

"user": "jdoe"

}

 

Here is an example of using the trim_value and whitespace arguments:

 

The raw log example is similar, but now formatted with pipe delimiters and quoted values, which is common in many firewall logs:

 

id:101| status:"failed"| error_msg:"connection refused"| user:"admin"

 

If we run a standard KV extraction, the values would retain the quotation marks (e.g., "failed" instead of failed). By applying trim_value and adjusting the split characters, we can clean the data during extraction.

 

kv {

source => "message"

field_split => "|"

value_split => ":"

trim_value => "\""

whitespace => "lenient"

}

 

After we apply the same function, we receive this textproto:

 

{

"error_msg": "connection refused",

"id": "101",

"message": "id:101| status:\"failed\"| error_msg:\"connection refused\"| user: \"admin\"",

"status": "failed",

"user": "admin"

}

 

Notice that we now see the values extracted without the surrounding quotation marks and the parser ignored the extra space before "admin" due to the lenient whitespace setting. This provides much clearer data for subsequent UDM mapping.

 

It is possible to use KV to break up key/value pairs found inside other extraction methods, such as the output of a Grok statement. This is useful when a specific part of an unstructured log suddenly shifts into a structured key-value format.

 

CSV

 

The Syntax:

 

csv {

source => "@var"

separator => "," optional (defaults to comma)

on_error => "csv_failed" optional

}

 

@var indicates the variable that you are attempting to extract the delimited data from. This will typically be "message".

 

separator identifies the character used to separate the values in the log line. While typically a comma, this can be configured to other characters such as a pipe (|) or tab.

 

on_error sets a boolean flag if the extraction fails, allowing you to handle malformed logs gracefully.

 

Here is an example of the csv output in work. Consider the raw log:

 

2025-12-08T18:53:40Z,allow,10.0.0.1,8.8.8.8,TCP,jdoe

 

The CSV string is message, so the resulting code block will simply be:

 

csv {

source => "message"

separator => ","

}

 

The resulting textproto that results is as follows (cleaned & highlighted for this guide). Note that because CSV logs do not contain headers, the parser automatically assigns the generic variable names column1, column2, etc. based on the position of the value.

 

{

"column1": "2025-12-08T18:53:40Z",

"column2": "allow",

"column3": "10.0.0.1",

"column4": "8.8.8.8",

"column5": "TCP",

"column6": "jdoe",

"message": "2025-12-08T18:53:40Z,allow,10.0.0.1,8.8.8.8,TCP,jdoe"

}

 

Here is an example of the necessary mutation step required for CSV:


Because the extractor outputs generic column names, you must manually map these to meaningful tokens using the mutate/replace function immediately after extraction.

 

csv {

source => "message"

separator => ","

}

mutate {

replace => {

"timestamp" => "%{column1}"

"action" => "%{column2}"

"src_ip" => "%{column3}"

"dst_ip" => "%{column4}"

"protocol" => "%{column5}"

"user" => "%{column6}"

}

}

 

After we apply the mutation function, we receive this textproto:

 

{

"action": "allow",

"column1": "2025-12-08T18:53:40Z",

"column2": "allow",

"column3": "10.0.0.1",

"column4": "8.8.8.8",

"column5": "TCP",

"column6": "jdoe",

"dst_ip": "8.8.8.8",

"message": "2025-12-08T18:53:40Z,allow,10.0.0.1,8.8.8.8,TCP,jdoe",

"protocol": "TCP",

"src_ip": "10.0.0.1",

"timestamp": "2025-12-08T18:53:40Z",

"user": "jdoe"

}

 

Notice that we now have usable tokens (like src_ip) derived from the generic columns. It is critical to know the exact order of fields in the raw log, as a shift in position in the source data will result in incorrect data mapping.


To show another avenue, you can also use the Rename mutation

 

csv {

source => "message"

separator => ","

}

mutate {

rename => {

"column1" => "timestamp"

"column2" => "action"

"column3" => "src_ip"

"column4" => "dst_ip"

"column5" => "protocol"

"column6" => "user"

}

}

 

We will cover the differences, and all of the Data Assignments later in this guide.

 

XML (Hierarchical Data Extraction)

 

The Syntax:

 

xml {

source => "@var"

xpath => {

"/Path/To/Node" => "token_name"

}

}

 

@var designates the variable from which you intend to extract the XML data. This is conventionally the "message" field.

 

xpath serves as a hash map where the key provides the specific XPath to a node within the XML structure, and the value is the designated name of the token to which the data will be assigned.

 

Here is an example of the XML filter in operation. Consider the following raw log:

 

<Event><System><EventID>4624</EventID><Computer>win-server-01</Computer><TimeCreated SystemTime="2025-12-08T18:53:40Z"/></System></Event>

 

Since the XML string is contained within the message field, the resulting filter block is structured as follows:

 

xml {

source => "message"

xpath => {

"/Event/System/EventID" => "event_id"

"/Event/System/Computer" => "hostname"

"/Event/System/TimeCreated/@SystemTime" => "event_time"

}

}

 

The resulting textproto output is:

 

{

"event_id": "4624",

"event_time": "2025-12-08T18:53:40Z",

"hostname": "win-server-01",

"message": "<Event><System><EventID>4624</EventID><Computer>win-server-01</Computer><TimeCreated SystemTime=\"2025-12-08T18:53:40Z\"/></System></Event>"

}

 

XML logs benefit greatly from iteration, which we will discuss later.

 

Critical Index Note: It must be noted that unlike JSON arrays, which are zero-indexed (start at 0), the XML XPath index in this parser commences with 1.

 

Grok

 

The Syntax:

 

grok {

match => {

"@var" => "%{PATTERN_NAME:token_name}"

}

overwrite => ["token_to_overwrite"] optional, unless you initialize the variables in earlier grok functions, or in a mutation.

on_error => "grok_failure_tag" optional

}

 

@var designates the variable (typically "message") containing the unstructured text intended for meticulous parsing.

 

match represents the instruction where the parsing pattern is explicitly defined. This pattern can employ either Predefined Patterns (using the syntax: %{PATTERN:token}) or Custom Regular Expressions (using the syntax: (?P<token>regex_pattern)).

 

overwrite is a potent, optional argument enabling the replacement of an existing field's value with a new value extracted by the Grok pattern. This feature is routinely utilized to effectively remove extraneous headers from a log message by extracting the core log "body" and subsequently superseding the original "message" token with this purified content.

 

on_error institutes an optional boolean flag to signal failure if the pattern fails to match the log line precisely. Given Grok’s inherently character-sensitive nature, this mechanism is paramount for robust error handling.

 

Critical Detail on Regular Expressions: It must be emphatically noted that, divergent from standard PCRE regex, the parser syntax within GoStash necessitates the use of double backslashes (\\) for all escape characters, in lieu of the conventional single backslash. For instance, to match a literal dot (.) or a digit (\d), one must explicitly write \\. and \\d, respectively.


Here is an illustrative instance of the Grok filter employing Predefined Patterns. Consider the following raw log:

 

Mar 15 11:08:06 hostdevice1: FW-112233: Accepted connection TCP 10.100.123.45:9988 to 8.8.8.8:53

 

Since the string resides in the message field, the resulting code block leveraging standard patterns is:

 

grok {

match => {

"message" => "%{SYSLOGTIMESTAMP:when} %{DATA:deviceName}: FW-%{INT:messageid}: (?P<action>Accepted|Denied) connection %{WORD:protocol} %{IP:srcAddr}:%{INT:srcPort} to %{IP:dstAddr}:%{INT:dstPort}"

}

}

 

The resultant textproto, presented for clarity and comprehension, is as follows:

 

{

"action": "Accepted",

"deviceName": "hostdevice1",

"dstAddr": "8.8.8.8",

"dstPort": "53",

"message": "Mar 15 11:08:06 hostdevice1: FW-112233: Accepted connection TCP 10.100.123.45/9988 to 8.8.8.8/53",

"messageid": "112233",

"protocol": "TCP",

"srcAddr": "10.100.123.45",

"srcPort": "9988",

"when": "Mar 15 11:08:06"

}

 

The official repo for grok patterns is here.

 

Next is a detailed example demonstrating both Custom Regex and the Overwrite function in tandem:


Frequently, logs incorporate a syslog header that impedes the effectiveness of subsequent parsers (such as JSON or KV). The Grok filter is used to "pare down" this header. As noted above, there is additional consideration when using special characters and escaping in GoStash. See the chart below for some examples, and see here for the official Syntax of RE2

 

Consider this raw log, where a syslog header precedes a structured JSON message:

 

<134>Oct 06 10:00:00 my-host {"user":"bob","action":"login"}

 

Should we attempt to apply the json filter immediately, it will inevitably fail due to the preceding timestamp and hostname information. It is therefore mandatory to use Grok to extract only the JSON payload and overwrite the contents of the message field.

 

grok {

Note the double backslashes \\s for space and \\S for non-whitespace

match => {

"message" => "^<%{INT}>%{SYSLOGTIMESTAMP}\\s%{HOSTNAME}\\s(?P<json_body>\\{.\\})"

}

We overwrite 'message' with the 'json_body' token we just extracted

overwrite => ["message"]

}



Now 'message' is pure JSON, so the JSON filter will succeed

json {

source => "message"

}

 

Following the application of this sequence, the message token is irrevocably altered for the duration of the pipeline. The final textproto structure is as follows:

 

{

"action": "login",

"json_body": "{\"user\":\"bob\",\"action\":\"login\"}",

"message": "{\"user\":\"bob\",\"action\":\"login\"}",

"user": "bob"

}

 

Observe how the message field no longer contains the disruptive syslog header (<134>Oct 06...), thereby ensuring the subsequent JSON parser executes without failure. This sequential architecture of Grok (for preliminary cleanup) + Overwrite + Structured Parser (JSON/KV) constitutes a foundational and essential pattern within the GoStash methodology. Hang on to those fields in the syslog header, as they can be beneficial in cases where the log generator is observing the event.

 

Base64 Decode

 

Base64 decode is an interesting one, as it unlocks encoded logs so that you can use any other extractor. It can also be used for individual fields as well.

 

The Syntax

 

base64 {

source => "@var"

target => "target_token"

encoding => "RawStandard" optional (defaults to Standard)

}

 

@var designates the variable containing the Base64 encoded string intended for immediate decoding. This can be an individual raw log field, or the entire raw log itself.

 

target specifies the destination variable where the resultant decoded plaintext string will be stored.

 

encoding is an optional parameter utilized to explicitly define the encoding scheme. It defaults to "Standard" but accommodates configurations such as "URL" or "RawStandard".

 

Here is an illustration of the Base64 filter in operation. Consider a scenario where a specific IP address has been encoded within the log:

 

ip_address="MjYwMzo4MDAwOjc2MDA6YzRlMTo0ZGI6NDAwYjpmZjI6NjYyNg=="

 

Assuming prior extraction has successfully isolated the value into the token ip_address, the requisite code block for decoding is as follows:

 

base64 {

source => "ip_address"

target => "ip_decoded"

}

 

The resulting textproto output (cleaned and highlighted for this guide) manifests:

 

{

"ip_address": "MjYwMzo4MDAwOjc2MDA6YzRlMTo0ZGI6NDAwYjpmZjI6NjYyNg==",

"ip_decoded": "2603:8000:7600:c4e1:4db:400b:ff2:6626"

}

 

Observe that the ip_decoded token now correctly contains the plaintext IPv6 address, rendering it immediately suitable for subsequent mapping to the UDM target IP field.

 

Data Transformation

 

Once we extract the data, we can begin to prepare the data for assignment. The ways that you can prepare the data is by using the following functions:

 

  • Convert
  • Gsub
  • Split
  • Date
  • Lowercase/Uppercase

 

Convert

 

This is powerful to make sure that the values that are in the raw log are transformed into the proper data type. Here are the data types that can be used in the convert function:

 

Based on the parsing syntax reference and the Unified Data Model (UDM) requirements found in the sources, here are the definitions for the specific data types.

 

It is important to note that while raw logs are ingested primarily as text, the UDM schema enforces strict typing. You must use the Convert filter to transform data into these formats before assignment.

 

String

 

  • Definition: A sequence of characters representing text.
  • Context: This is the default state of almost all data extracted from a raw log (via Grok, KV, CSV, etc.) before any mutation occurs.

 

Integer

 

  • Definition: A whole number (positive, negative, or zero).
  • Context: Used for standard numerical fields in the schema.

 

uinteger (unsigned integer)

 

  • Definition: A whole number that must be non-negative (0 or greater).
  • Context: This is frequently used for metrics that cannot mathematically be negative, such as byte counts or packet counts.

 

Float

 

  • Definition: A number that contains a decimal point.
  • Context: Used for precise measurements or coordinates.

 

Boolean

 

  • Definition: A binary value representing true or false.
  • Context: Used for flags and status indicators.

 

ipaddress

 

  • Definition: A specialized format for IPv4 and IPv6 addresses.
  • Context: Unlike a standard string, converting to this type validates the structure of the data. If the value is not a valid IP, the conversion will fail.
  • Additional Context: IP address fields need to be converted before assigning the value to the target schema field. It is also used as a logic check; you can attempt to convert a host field to ipaddress and use on_error to determine if the value is an IP or a Hostname.

 

The Syntax:

 

mutate {

convert => {

"token_name" => "target_data_type"

}

on_error => "error_tag" optional

}

 

token_name designates the variable currently holding the value intended for systematic transformation.

 

target_data_type specifies the precise format into which the data is being converted.

 

on_error institutes an optional boolean flag to signal conversion failure (e.g., when attempting to coerce the textual string "N/A" into a numerical integer). This mechanism is paramount for robust error handling and maintaining data validity.

 

Necessity of Conversion: Raw log files typically contain data exclusively as simple strings. Conversely, the Google SecOps Unified Data Model (UDM) rigorously mandates strict data typing. Consequently, a token representing a port number must be explicitly converted to an integer, and an IP address must be converted to a valid ipaddress format. Neglecting this conversion step risks parser failure or the silent dropping of the value during the mapping process.


Here is an illustration of the Convert function in operation. Consider the following raw log:

 

{"process_id": "1024", "duration_ms": "50.5", "is_admin": "true"}

 

Following the JSON extraction phase, these values are initially treated as strings due to the surrounding quotation marks. They must be appropriately converted prior to their assignment to numerical fields within the UDM schema. Keep in mind that attempting to convert a field that is already the target data type will result in an error

 

json {

source => "message"

}

mutate {

convert => {

"process_id" => "uinteger"

"duration_ms" => "float"

"is_admin" => "boolean"

}

}

 

The resultant textproto, cleaned and highlighted for clear comprehension, is as follows:

 

{

"duration_ms": 50.5,

"is_admin": true,

"message": "{\"process_id\": \"1024\", \"duration_ms\": \"50.5\", \"is_admin\": \"true\"}",

"process_id": 1024

}

 

Observe that within the output, the values corresponding to process_id, duration_ms, and is_admin are no longer encapsulated by quotation marks, unequivocally confirming their successful transformation into the stipulated numerical and boolean data types.

 

Utilizing Convert for Validation:


The Convert function can also be used as a validation tool, particularly when dealing with IP addresses. Any attempt to convert a malformed string to an ipaddress will immediately trigger the on_error flag.

 

mutate {

convert => {

"source_ip" => "ipaddress"

}

on_error => "not_a_valid_ip"

}

 

Subsequently, conditional logic can be executed using the error flag:

 

if [not_a_valid_ip] {

Handle the error, perhaps by mapping the non-compliant value to an auxiliary description field

mutate {

merge => { "event.idm.read_only_udm.principal.ip" => "0.0.0.0" }

}

}

 

Gsub

 

The Syntax:

 

mutate {

gsub => [

"field_name", "regex_pattern", "replacement_string",

"field_name_2", "regex_pattern_2", "replacement_string_2" more than one field substitution is optional

]

}

 

gsub is the acronym for Global Substitution. Its function is to systematically match a regular expression against a designated field's value and subsequently replace all discovered instances with the specified replacement string. This modification is restricted exclusively to string data types.

 

The Configuration Array is structured to accept sets of three elements for each required substitution:

 

  1. field_name: The literal name of the token whose value requires modification.
  2. regex_pattern: The regular expression pattern employed for matching (which adheres to RE2 syntax).
  3. replacement_string: The precise text or value designated to be inserted in place of the matched pattern.


Here is an illustrative example of gsub in action. Consider a scenario where a file path employs forward slashes, but the established UDM convention dictates the use of underscores:

 

{"file_path": "/var/log/syslog", "service": "cron"}

 

Assuming the prerequisite JSON extraction has already been completed, the resulting filter block is configured as follows:

 

mutate {

gsub => [

Replace all forward slashes with underscores

"file_path", "/", "_"

]

}

 

The resultant textproto, presented for immediate clarity and comprehension, yields:

 

{

"file_path": "_var_log_syslog",

"message": "{\"file_path\": \"/var/log/syslog\", \"service\": \"cron\"}",

"service": "cron"

}

 

Escaping Special Characters: The Quadruple Backslash Rule

 

A critical and imperative exception exists when utilizing gsub for the purpose of matching and escaping a literal backslash (\). To successfully match a single backslash character within the raw text, you must explicitly employ four backslashes (\\\\) in the pattern definition.

 

  • Rationale for the Multi-Level Escaping: This is necessitated by two distinct levels of parser interpretation:
  1. The double-quoted string parser initially interprets the literal \\\\ sequence and simplifies it to \\.
  2. Subsequently, the underlying Regular Expression engine receives \\ and interprets this as an escaped literal backslash (\) to be matched against the text.


Example of Escaping: Consider a scenario where a domain user is formatted as DOMAIN\User. Our objective is to replace the backslash with a period (.):

 

mutate {

gsub => [

Pattern matches a literal backslash

"username", "\\\\", "."

]

}

 

Input: The value DOMAIN\User.

Output: The transformed value DOMAIN.User

 

Split

 

The Syntax:

 

mutate {

split => {

source => "source_field"

separator => "delimiter"

target => "target_field"

}

}

 

source indicates the existing variable (token) containing the string you wish to break apart.

 

separator identifies the character that delimits the items within the string, such as a comma (,), pipe (|), or semicolon (;).

 

target identifies the variable where the resulting array will be stored. If you want to overwrite the original string with the new array, you can set the target to be the same name as the source.

 

Why this is necessary: The Split function is used to transform a single string into an iterable array. This is critical when mapping data to UDM fields that are defined as "repeated" (arrays), such as principal.ip (which can hold multiple IPs) or security_result.category_details. Without splitting, the parser would treat a list like "admin,hr,finance" as a single long string rather than three distinct values.


Here is an example of the split output in operation. Consider a raw log where user groups are listed in a single field separated by commas:

 

user:jdoe groups:admin,wifi_users,vpn_access

 

First, we extract the data using KV. At this stage, the token groups is just one long string: "admin,wifi_users,vpn_access".

 

kv {

source => "message"

field_split => " "

value_split => ":"

}

mutate {

split => {

source => "groups"

separator => ","

target => "groups_array"

}

}

 

The resulting textproto output is as follows (cleaned & highlighted for this guide):

 

{

"groups": "admin,wifi_users,vpn_access",
"groups_array": {
"0": "admin",
"1": "wifi_users",
"2": "vpn_access"
}

"message": "user:jdoe groups:admin,wifi_users,vpn_access",

"user": "jdoe"

}

 

**Notice that **groups_array is now a list. You can now use this array to map multiple values to a single UDM field, or iterate through it using a for loop to perform logic on each individual group.

 

Date

 

The Syntax:

 

date {

match => ["token_name", "format_pattern", "optional_fallback_pattern"] optional_fallback_pattern is kwargs, meaning you can have a multitude of patterns.

timezone => "America/New_York" optional

rebase => true optional

target => "target_field" optional

on_error => "error_tag" optional

}

 

match is the array stipulating the extracted token (the raw timestamp string) coupled with the requisite format pattern(s) guiding the parser on precise interpretation of the time value.

 

timezone applies a specified time zone offset if the source log intrinsically lacks one. Should this be omitted, and the log provides no offset, the system may default the time to UTC. Look here for the list of timezones that can be referenced.

 

rebase is a crucial boolean parameter employed for logs that contain a date without an explicit year (a characteristic often seen in Syslog, e.g., "Mar 15 11:08:06"). Activating this sets the year based on the ingestion time, thereby preempting indexing anomalies.

 

target permits the normalized timestamp to be saved into an alternative, designated field (such as metadata.collected_timestamp). If this parameter is not specified, the filter automatically populates the default @timestamp field, which conventionally maps to metadata.event_timestamp.

 

Why this is necessary:

 

The UDM fields designated for timestamps stringently require a properly normalized date value. A simple string such as "2023-12-01" is insufficient; the data must first undergo successful processing by this specific function before it can be mapped to a UDM timestamp field.


Here is an illustrative example of the Date filter in active use. Consider a raw log containing an extracted timestamp:

 

time="2025-12-08 18:53:40" user="jdoe"

 

Assuming the prerequisite extraction has placed the timestamp into the token log_time, the resulting filter block is configured as follows:

 

date {

match => ["log_time", "yyyy-MM-dd HH:mm:ss"]

timezone => "UTC"

}

 

The resulting textproto is:

 

{

"@timestamp": {

"nanos": 0,

"seconds": 1765306420

},

"log_time": "2025-12-08 18:53:40",

"message": "time=\"2025-12-08 18:53:40\" user=\"jdoe\"",

"user": "jdoe"

}

 

Observe that the parser successfully generated the @timestamp field (expressed in seconds and nanoseconds) directly from the raw string input.

 

Supported Named Date Patterns:

 

Google SecOps formally supports the following essential predefined named date formats:

 

  • ISO 8601
  • RFC 3339
  • UNIX
  • UNIX_MS
  • TIMESTAMP_ISO8601

 

Using Custom Patterns:

 

Should your log not conform to any of the established named patterns, you possess the ability to define a custom format string. This is achieved through the utilization of standard date and time characters (e.g., yyyy for year, MM for month, and HH for hour) to meticulously match the exact structure of the log data. Check this Google SecOps Documentation for more information

 

Lowercase

 

The lowercase mutation function serves to convert all characters within a designated token to their lower-case equivalent.

 

The Syntax:

 

mutate {

lowercase => [ "token_name" ]

}

 

token_name indicates the variable (or list of variables) containing the string values you wish to convert to lowercase.

 

Why this is necessary: Different vendors may send the same value in different formats (e.g., "TCP", "tcp", or "Tcp"). Converting these fields to a standard lowercase format ensures that search queries, detection rules, and aggregations function consistently regardless of the input format. For critical fields that warrant the change in case structure, this is not recommended.

 

Here is an example of the lowercase output in work. Consider a raw log where the protocol and username are in uppercase or mixed case:

 

protocol="TCP" username="AdminUser"

 

We assume extraction has already occurred via KV or Grok. The resulting code block will be:

 

mutate {

lowercase => [ "protocol", "username" ]

}

 

The resulting textproto result is as follows (cleaned & highlighted for this guide):

 

{

"message": "protocol=\"TCP\" username=\"AdminUser\"",

"protocol": "tcp",

"username": "adminuser"

}

 

Notice that the values for protocol and username have been transformed to "tcp" and "adminuser" respectively. This ensures that a search for user = "adminuser" will succeed even if the raw log contains "AdminUser".

 

**Additional Note: **uppercase functions the same way.

 

Data Assignment

 

Now, we can get on with simple assignments. The functions that allow for assigning into a UDM field are the following:

 

  • Replace
  • Rename
  • Copy
  • Merge

 

Replace

 

The Replace function is the principal mechanism for assigning string values to a given token. It is versatile, permitting the assignment of constants, utilizing existing field values, or synthesizing a combination of both. Replace restricts the value being assigned to be a string. If it is anything but a string, you will receive an error message.

 

The Syntax:

 

mutate {

replace => {

"target_field" => "value_to_assign"

}

}

 

1. Instantiation of Fields (Variable Declaration)

 

A fundamental best practice within complex parsers involves the “instantiation” of designated tokens at the beginning of your code. This step ensures that the variables are instantiated (typically as empty strings) before any subsequent logic attempts to reference them, thus preventing runtime errors if a specific log entry happens to omit that particular data point.

 

This technique proves particularly helpful if you plan to use the merge function later in the pipeline, or if it is necessary to guarantee a field is non-null before it undergoes evaluation in a conditional statement.


Example: In this instance, we initialize a temporary token designated as destination to an empty string prior to leveraging the KV filter for its subsequent population.

 

filter {

Initialize the token to ensure it exists

mutate {

replace => {

"destination" => ""

}

}



Extract data (which might or might not contain the destination)

kv {

source => "message"

field_split => " "

trim_value => "\""

on_error => "kvfail"

}



Later assignment

mutate {

replace => {

"event.idm.read_only_udm.target.hostname" => "%{destination}"

}

}

}

 

2. Assigning Constants (Hardcoding)

You will routinely utilize replace to hardcode specific UDM fields that must remain invariant for every log generated by the current parser, such as the vendor designation or the overarching event type.


Example: Irrespective of the log content, we stipulate that this event originates from "CrowdStrike" and the product is definitively "Falcon".

 

mutate {

replace => {

"event.idm.read_only_udm.metadata.vendor_name" => "CrowdStrike"

"event.idm.read_only_udm.metadata.product_name" => "Falcon"

"event.idm.read_only_udm.metadata.event_type" => "EDR_PROCESS"

}

}

 

3. Assigning Variables (Dynamic Mapping)

To accurately map the data previously extracted (e.g., a username discovered via Grok) to its appropriate UDM field, you employ the %{token_name} syntax. This command instructs the parser to retrieve the value contained within that specified variable and assign it to the target field.


Example: We are retrieving the value encapsulated within the src_user token and assigning it to the UDM field: principal.user.userid.

 

mutate {

replace => {

"event.idm.read_only_udm.principal.user.userid" => "%{src_user}"

}

}

 

4. Combination (String Interpolation)

It is feasible to synthesize both constants and variables within a single assignment operation. This technique is invaluable for the structured construction of URLs, detailed descriptions, or unique identifiers.


Example: Structuring a comprehensive description string that seamlessly incorporates dynamic data elements.

 

mutate {

replace => {

Result: "User jdoe attempted login from 10.0.0.1"

"event.idm.read_only_udm.metadata.description" => "User %{username} attempted %{action} from %{src_ip}"

}

}

 

Rename

 

The Syntax:

 

mutate {

rename => {

"original_token" => "new_token"

}

}

 

original_token is the current designation of the variable (e.g., extracted via Grok, KV, or JSON).

 

new_token is the revised name designated for the variable moving forward. This is typically a UDM field name (e.g., event.idm.read_only_udm.network.ip_protocol**), though it can also be a temporary token selected for enhanced code clarity.**

 

Why this is necessary: The rename** **function performs a "destructive" assignment; Rename moves the value to the new token and systematically removes the original. This methodology promotes high efficiency when mapping extracted data directly to the UDM schema, as it maintains a clean internal state by eliminating redundant intermediate variables.


Here is an example of the rename output in operation. Consider a log scenario where we have successfully extracted a protocol and a source port:

 

proto="tcp" srcport="443"

 

We assume these values currently reside in the tokens proto and srcport. The resulting filter block is configured as follows:

 

mutate {

rename => {

"proto" => "event.idm.read_only_udm.network.ip_protocol"

"srcport" => "event.idm.read_only_udm.network.target.port"

}

}

 

The resultant textproto output is:

 

{

"event": {

"idm": {

"read_only_udm": {

"network": {

"ip_protocol": "tcp",

"target": {

"port": 443

}

}

}

}

},

"message": "proto=\"tcp\" srcport=\"443\""

}

 

Observe that the fields** proto** and srcport no longer persist in the root output; they have been entirely migrated into the UDM structural hierarchy.

 

Critical Requirement: It is imperative that the original token and the new token are of identical data types before executing the rename transformation. For example, if a port number is initially extracted as a string (the default state) but the designated UDM field target.port** mandates an integer, the Convert** function must be applied prior** **to the Rename operation.

 

Correct Order of Operations:

 

mutate {

1. Convert the data type first

convert => {

"srcport" => "integer"

}

}

mutate {

2. Rename (move) the token to the UDM field

rename => {

"srcport" => "event.idm.read_only_udm.network.target.port"

}

}

 

Copy

 

The Syntax:

 

mutate {

copy => {

"destination_token" => "source_token"

}

}

 

source_token is the existing variable containing the data you wish to duplicate.

 

destination_token is the designated name of the new variable where the data will be placed. Note that if this token already exists, it will be superseded; if it does not, it will be instantiated.

 

Why this is necessary: The Copy function executes a "deep copy" of a token, which means the resulting new token operates entirely independently of the original source. This is very important when the need arises to:

 

  1. Map one value to multiple UDM fields: For instance, assigning a single IP address to both the principal.ip** and asset.ip fields.
  2. Preserve original data: You might opt to retain the raw extracted value (e.g., mixed-case "AdminUser") for incorporation into a description field, while concurrently manipulating a dedicated copy (e.g., lowercased "adminuser") for the strictly normalized user ID field. Since subsequent modifications to the destination token exert no effect on the source token, the integrity of the original data is ensured.
  3. Append to the main parser from a parser extension: Rather than having to rewrite parts of the main parser again in your parser extension, you can use copy to append values to the main parser.


Here is an illustrative example of the copy output in operation. Consider a scenario where a username has been extracted, and it is necessary to utilize it for both the User ID and a description string.

 

user="JDOE"

 

We assume that the user token has already been successfully populated during the extraction phase. Our requirement is to transform one version to lowercase for the ID field, while maintaining the uppercase version for a display string.

 

mutate {

Create an independent deep copy of the user token

copy => {

"user_clean" => "user"

}

}

mutate {

Modify the newly created copy, thereby leaving the original 'user' token untouched

lowercase => [ "user_clean" ]

}

mutate {

replace => {

"event.idm.read_only_udm.principal.user.userid" => "%{user_clean}"

"event.idm.read_only_udm.security_result.description" => "User %{user} logged in."

}

}

 

The resultant textproto output is as follows:

 

{

"event": {

"idm": {

"read_only_udm": {

"principal": {

"user": {

"userid": "jdoe"

}

},

"security_result": {

"description": "User JDOE logged in."

}

}

}

},

"message": "user=\"JDOE\"",

"user": "JDOE",

"user_clean": "jdoe"

}

 

Observe that the user token was retained as "JDOE" (uppercase), whereas user_clean was successfully converted to lowercase as "jdoe".

 

Merge

 

The Syntax:

 

mutate {

merge => {

"target_field" => "source_field"

}

}

 

source_field is the variable containing the data you wish to add.

 

target_field is the variable (usually a UDM field or the final output) receiving the data.

 

Why this is necessary: The Merge function is utilized to join multiple fields together or append new values to an existing list. It is fundamentally distinct from replace because it performs a combination rather than an overwriting of existing data. This operation is essential for populating repeated fields (arrays) within the UDM schema and serves as the mandatory final step for delivering the event to the ingestion pipeline.

 

Key Behavior: Merge employs a "naive" implementation regarding existence checks. Should you attempt to merge a finite collection of values (e.g., ips.0, ips.1, ips.2), the function will successfully process the values that are present and silently omit any that do not exist, thereby preventing parser failure.

 

Examples

 

1. Populating Repeated Fields

 

If a single log entry contains multiple IP addresses for the same entity (e.g., an IPv4 and an IPv6 address), you cannot use replace for the second one, as it would unilaterally overwrite the first. The use of merge is mandatory.

 

mutate {

merge => {

"event.idm.read_only_udm.target.ip" => "dst_ipv4"

}

}

mutate {

merge => {

"event.idm.read_only_udm.target.ip" => "dst_ipv6"

}

}

 

Result:** The target.ip field in UDM will contain a list: ["192.168.1.1", "fe80::1"].**

 

2. Mapping fields to higher-level Repeated categories

 

A highly common and powerful use of Merge involves "staging" data in a temporary variable before committing it to the UDM. This is particularly useful for repeated fields that contain nested objects, such as security_result or attribute.labels.

 

Because security_result is a list (an array) of objects, you often need to define multiple properties (Action, Severity, Description) for a single result, and potentially add multiple results for a single log.

 

The Strategy for Populating Repeated Nested Fields:

 

  1. Utilize Replace to construct a provisional "container" variable (e.g., temp_result) and systematically populate its sub-fields with extracted data.
  2. Employ Merge to append the entirety of that prepared container into the final UDM array structure.

 

The Syntax:

 

mutate {

1. Build the temporary object with all necessary sub-fields

replace => {

"temp_result.action" => "BLOCK"

"temp_result.severity" => "HIGH"

"temp_result.description" => "Malware detected by firewall"

}

}

mutate {

2. Merge the temporary object into the UDM repeated field

merge => {

"event.idm.read_only_udm.security_result" => "temp_result"

}

}

 

Why this is necessary: Should you attempt to use replace directly on event.idm.read_only_udm.security_result.action, you incorrectly treat a repeated field as a scalar (single) value. If the raw log documents two distinct security findings, the second replace operation would summarily overwrite the first result. Conversely, by staging the data in temp_result and then merging the complete object, you instruct the parser to "Add this discrete result object to the accumulating list of results".

 

Here is an example concerning labels. The established method for handling labels, which are defined as key-value pairs stored within a list, requires iterating through the raw data. This process involves creating a temporary label object that contains both a .key field and a .value field, and subsequently merging this temporary object into the final UDM structure.

 

for Source in sources{  -  Constructing a complex object inside a loop

mutate {

replace => {

Staging the data in a temporary variable 'resId_map_label'

"resId_map_label.key" => "ResourceId"

"resId_map_label.value" => "%{Source.resourceId}"

}

}

mutate {

Merging the staged object into the final UDM array

merge => {

"event.idm.read_only_udm.principal.resource.attribute.labels" => "resId_map_label"

}

}

}

 

3. Generating Output (Mandatory Step)

 

The final and non-negotiable step of every parser is to merge the completed event object into the specialized @output field. This signals to Google SecOps that the parsing process is finished and the event record is fully prepared for ingestion

 

mutate {

merge => {

"@output" => "event"

}

}

 

Distinction: Replace vs. Merge in UDM Fields

 

Understanding the precise context for employing replace versus merge is paramount, given that the (UDM) rigorously enforces cardinality—whether a field permits a single value (scalar) or multiple values (plurality).

 

1. Fields that require REPLACE (Scalar/Single-Value String Fields)

 

These UDM fields are logically or mathematically restricted to hold only one value at any given time. Attempting to merge a second value into them would result in a schema violation or parser failure. Consequently, you must employ replace to set these values.

 

  • **Logic: **"Set this value."
  • Examples:
  • metadata.event_timestamp (An event occurs at a single, precise moment.)
  • metadata.event_type (An event belongs to one specific type, e.g., PROCESS_LAUNCH.)
  • security_result.action (A firewall either ALLOWed or BLOCKed the packet; it cannot perform both.)
  • target.port (A connection must target a specific integer port.)

 

2. Fields that require MERGE (Repeated/Array Fields)

 

These UDM fields are formally defined as "repeated" within the schema, designed specifically to accommodate a list of values. While it is technically possible to use replace to define the first value, best practice, coupled with sound parsing logic, strongly dictates the use of merge to ensure that the field can gracefully handle multiple values if they are supplied by the raw log source.

 

  • **Logic: **"Add to this list."
  • Examples:
  • principal.ip (A device can be associated with multiple IP addresses concurrently.)
  • target.user.email_addresses (A user is permitted to have multiple email aliases.)
  • principal.resource.attribute.labels (A resource can be assigned an unlimited sequence of descriptive labels or tags.)
  • intermediary.hostname (A network connection may traverse multiple intermediary proxy hops.)

 

Summary Table:

 

Function

Action

Target Data Type

UDM Field Example

Replace

Overwrites

String (Scalar)

metadata.vendor_name

Merge

Appends/Joins

Array (Repeated)

principal.ip

 

Final Pieces to GoStash

 

There are a couple more components to Gostash needed to make you a parsing champion. They are conditionals, Looping, and Drop.

 

Conditional Logic

 

The Syntax:

 

Logstash within the Google SecOps environment supports fundamental conditional flow constructs, notably the if, else if, and else statements. These mechanisms are paramount for precisely governing the execution sequence of parsing logic based upon the presence and valuation of tokens within the processed logs.

 

if [token_1] == "value" {
#Logic to execute if true

mutate { ... }

}

else if [token_1] == "other_value" or [token_2] == "value" {

Logic to execute if the first condition failed, but this one is true

}
else if [token_1][subtoken1] == "value" {
Logic to execute if the subtoken in a nested field matches
}

else {

Fallback logic if no conditions are met

}

 

Supported Operators:

 

The following set of operators is available for evaluating and comparing tokens:

 

  • Equality: Utilizes == (Is equal to) and != (Is not equal to).
  • Comparison: Enables relative magnitude checks: < (Less than), > (Greater than), <= (Less than or equal), and >= (Greater than or equal).
  • Regular Expressions: Provides pattern matching capabilities: =~ (Matches regex) and !~ (Does not match regex).
  • Membership: The in operator efficiently ascertains whether a specified value is contained within a defined list or array.
  • Boolean Logic: Supports complex, multi-condition evaluation using and and or.

 

Why this is necessary: Conditional logic acts as the decision-making core of your parser. Its primary applications include:

 

  1. Normalization: The process of mapping disparate vendor terminology (e.g., "drop," "block," or "deny") to a singular, authoritative UDM standard value (e.g., "BLOCK").
  2. Safety Checks: Ensuring the integrity of the data pipeline by verifying that a token is present or not empty prior to executing sensitive operations like base64 decoding.
  3. Error Handling: Reacting strategically to error tags set by the on_error mechanism to isolate and manage logs that failed previous parsing stages.
  4. Filtering: Employing the unequivocal drop {} filter to systematically remove irrelevant or non-actionable log entries from the pipeline.

Examples

 

1. Normalizing Vendor Actions (The if/else Chain)

 

 Different security vendors frequently utilize diverse terminology to describe identical actions. The following chain is fundamental for standardizing disparate "block" terms and "allow" terms into the mandated UDM actions, BLOCK and ALLOW, respectively.

 

if [action] == "drop" or [action] == "deny" or [action] == "block" {

mutate {

merge => {

"event.idm.read_only_udm.security_result.action" => "BLOCK"

}

}

}

else if [action] == "allow" or [action] == "permit" {

mutate {

merge => {

"event.idm.read_only_udm.security_result.action" => "ALLOW"

}

}

}

else {

mutate {

merge => {

"event.idm.read_only_udm.security_result.action" => "UNKNOWN_ACTION"

}

}

}

 

2. List Membership (in) 

 

This functionality allows for an efficient determination of whether a value is present within a pre-defined list or array. This is a significantly cleaner and more scalable alternative to composing multiple or conditions.

 

 Normalize protocol only if it matches known standard protocols

if [protocol] in ["TCP", "UDP", "ICMP"] {

mutate {

lowercase => [ "protocol" ]

}

}

 

3. Safety Checks (Not Empty) To ensure robust parsing stability, it is an essential best practice to confirm the existence and non-empty status of a field prior to initiating resource-intensive or complex operations, such as Base64 decoding or data merging.

 

  Only attempt to decode if the ip_address token is NOT empty

if [ip_address] != "" {

base64 {

source => "ip_address"

target => "ip_address_string"

}

}

 

Looping

 

The Syntax:

 

The for loop construct within the Google SecOps environment is employed for iterating over arrays (lists) or JSON objects. This functionality is absolutely essential when a singular log entry encompasses multiple values for a specific field, such as a list of email recipients, numerous IP addresses, or dynamic tags.

 

There are three primary variations of the for loop:

 

1. Basic Loop over Array

 

This function iterates sequentially through every item contained within a designated list.

 

for item in array_variable {

mutate {

merge => { "target_field" => "item" }

}

}

 

2. Loop over Array with Index

 

This iterates through the elements of the list while concurrently tracking the numerical position (index) of the item (0, 1, 2, and so forth).

 

for index, item in array_variable {

Logic using both index and item

}

 

3. Loop over Key-Value Pairs (Map)

 

This is utilized for iterating through the inherent properties of a JSON object. This specific mechanism mandates the explicit inclusion of the map keyword.

 

for key, value in object_variable map {

Logic using key and value strings

}

 

Rationale for Necessity: Standard replace or rename functions are explicitly designed to handle single, scalar values. Conversely, many contemporary log formats (particularly JSON originating from cloud infrastructure providers) inherently contain arrays. For example, a log may contain {"tags": ["production", "web-server", "critical"]}. To successfully map all three tags to the UDM attribute.labels field, one cannot merely copy the array directly; it is mandatory to iterate through the array and rigorously merge each constituent item individually.

 

**Prerequisite: **split_columns. To enable successful iteration over a JSON array, it is often critical to employ the array_function => "split_columns" argument during the initial JSON extraction process. This step is what makes the discrete elements of the array fully accessible to the subsequent loop logic.

 

Examples

 

1. Looping over a Simple List

 

The Raw Log: Consider a log entry where the session object contains an ip_address field. This field is an array of strings, representing the various IPs a user connected from during the session.

 

{

"session": {

"session_id": "xuvcrjjjUk",

"ip_address": [

"194.206.109.196",

"203.0.113.45",

"198.51.100.12"

]

}

}

 

The Code: To iterate through this list, extract the JSON first. It is mandatory to incorporate the array_function => "split_columns" argument within the extraction filter. This function segments the array elements, making them fully accessible to the subsequent loop logic.

 

filter {

json {

source => "message"

array_function => "split_columns"

on_error => "json_failure"

}



Iterate over every IP in the session.ip_address array

for ip in session.ip_address {

mutate {

Merge appends each IP string to the UDM principal.ip list

merge => {

"event.idm.read_only_udm.principal.ip" => "ip"

}

}

}

}

 

The Output: The merge function, positioned inside the loop, ensures that every single IP address within the array is appended to the designated UDM field, preventing overwriting. The resulting normalized UDM structure is as follows:

 

"event": {

"idm": {

"read_only_udm": {

"principal": {

"ip": [

"194.206.109.196",

"203.0.113.45",

"198.51.100.12"

]

}

}

}

}

 

2. Looping with an Index

 

When it becomes necessary to precisely label the data based on its ordinal sequence (e.g., "phoneNumber 0", "phoneNumber 1"), the index syntax must be explicitly employed.

 

for index, phoneNumber in businessPhones {

mutate {
#Convert index to string so it can be used in a string field
convert => { "index" => "string" }
}
mutate{
replace => {
#create the temporary field for merging (doubles as clearing the field when interating)
"temp_label" => ""
}
mutate {
replace => {
Construct a key-value pair using the index
"temp_label.key" => "phoneNumber %{index}"
"temp_label.value" => "%{phoneNumber}"
}

}

mutate {
merge => {
"event.idm.read_only_udm.principal.resource.attribute.labels" => "temp_label"

}

}

}

 

Result: The resultant labels will manifest as key: "phoneNumber 0", value: "(123) 234-2320", precisely reflecting the data's position.

 

3. Looping over Maps (Dynamic Objects)

 

Occasionally, you will encounter a JSON object where the explicit field names are not known beforehand, {"labels": {"env": "dev", "owner": "admin"}}. To handle this structure, you must utilize the map keyword to extract both the dynamic key and its corresponding value.

 

for key, value in resource.labels map {

mutate {

replace => {

"temp_label.key" => "%{key}"

"temp_label.value" => "%{value}"

}

}

mutate {

merge => {

"event.idm.read_only_udm.principal.resource.attribute.labels" => "temp_label"

}

}

}

 

Result: This process guarantees the capture of every key-value pair within the object without the necessity of hardcoding individual field names.

 

4. Nested Loops

 

In scenarios involving complex logs with arrays nested inside other arrays (e.g., a list of resources, where each resource subsequently possesses a list of IP addresses), the use of nested for loops is the prescribed method.

 

for item in resourceIdentifiers {

Logic for the outer item

for sub_item in item.subnets {

Logic for the inner item

}

}

 

This powerful, recursive approach permits the flattening of deeply nested data structures to align them with the UDM schema.

 

Drop

 

The Syntax:

 

if [token] == "value" {

drop {

tag => "TAG_REASON" optional

}

}

 

tag is an optional property that assigns a label to the discarded message. This is highly useful for systematically tracking ingestion metrics or diligently debugging precisely why certain logs were systematically rejected from the pipeline.

 

Why this is necessary: The Drop function executes a comprehensive halt on all further processing of a log message. When an event encounters a drop filter, it is summarily discarded and is explicitly prevented from proceeding to the "Data Assignment" stage or final ingestion. This mechanism is absolutely critical for:

 

  1. Noise Reduction: Filtering out high-volume, low-value logs (e.g., routine successful load balancer health checks or verbose debug messages) that needlessly consume valuable quota without contributing to effective security detection.
  2. Data Quality: Systematically discarding logs that are demonstrably malformed or critically deficient, lacking essential fields (such as a valid timestamp or IP address) that are mandated for valid UDM record creation.

 

Important Note: The drop function is virtually always encapsulated within Conditional Logic (if statements). Should a drop {} filter be inadvertently placed at the top level of your parser without a governing condition, it will ruthlessly discard every single log event it encounters. Additionally, if the drop filter executes, you will not be able to utilize a parser extension, as the entire normalization process is cancelled.

 

Examples

 

1. Basic Filtering (Dropping Placeholder Data) 

 

Certain log formats employ a dash (-) to explicitly indicate that a specific field is empty or irrelevant. If your established security methodology determines that logs containing a specified empty field are functionally useless, they should be immediately dropped.

 

filter {

If the domain token contains only a dash, discard the log

if [domain] == "-" {

drop {}

}

}

 

Result: The log entry is irrevocably removed from the data pipeline and is deliberately not committed to storage in Google SecOps.

 

2. Dropping with a Tag (Error Handling)

 

The best practice dictates that logs destined for dropping should be tagged. This proactive measure enables monitoring and metric tracking of how many logs are being systematically discarded, and, crucially, illuminates the precise rationale behind their rejection. This is routinely paired with on_error flags generated by preceding filters.

 

filter {

Attempt to coerce a value to an integer

mutate {

convert => {

"port" => "integer"

}

on_error => "port_conversion_failed"

}



If the conversion failed, the log is likely malformed. Drop and Tag it.

if [port_conversion_failed] {

drop {

tag => "TAG_MALFORMED_PORT"

}

}

}

 

Result: The malformed log is fully discarded, yet the ingestion system retains a record that it was rejected with the explicit tag TAG_MALFORMED_PORT.

 

For monitoring of dropped logs, you can use the Ingestion metrics schema in Dashboards. Here is the documentation

 

Conclusion

 

Mastering GoStash is the definitive step in transforming raw, unstructured log data into the structured, actionable intelligence required by the Unified Data Model (UDM). By systematically applying the three main categories—Extraction, Manipulation, and Assignment—you gain total control over the ingestion pipeline, ensuring that diverse vendor logs are normalized into a unified schema.

From the seamless efficiency of the JSON filter to the robust, meticulous parsing capabilities of Grok for unstructured data, you now possess the methodology to handle any log format. Beyond simple field mapping, the advanced application of Conditionals and Looping allows for the handling of sophisticated, multi-value arrays and dynamic objects. Furthermore, understanding the critical distinction between replace for scalar values and merge for repeated fields is paramount for avoiding schema violations and ensuring a healthy UDM record,.

 

Recommended Next Steps

 

To ensure a robust parsing strategy within Google SecOps, adhere to the following practical path forward:

  1. Prioritize Structure: Always attempt to use structured filters like JSON, KV, or XML first, as they are self-describing and efficient. Reserve the use of Grok for unstructured text where no other option exists.
  2. Enforce Data Typing: Rigorously utilize the Convert function to transform string data into the strict integers, floats, and IP addresses mandated by the UDM schema.
  3. Implement Safety Checks: Use Conditionals to verify fields exist before operations like Base64 decoding, and utilize the Drop function to systematically filter out noise and malformed logs to preserve quota and data quality.
  4. Commit the Event: Remember that the parsing process is only complete when you perform the mandatory Merge of your event object into the @output field.

Properly architected GoStash code turns a flood of disparate logs into a harmonized dataset, ready for detection and response. With these tools and logic constructs at your disposal, you are now fully equipped to become a parsing champion.