Adoption Guide: Deep Dive into UDM Parsing - Part 1.2

Forum|Forum|3 months ago
July 28, 2025
0 replies
236 views

Digital-Customer-Excellence
Staff

Capturing Basic JSON fields

The main body of a parser Go-stash are enclosed within a filter{...} statement like this ;

filter {

}

In the following section, we will examine common JSON transformation operations available in GoStash. For each operation, we will provide a detailed explanation of its syntax, demonstrate its effect on JSON data using 'statedump' debug output snippets, and include notes to highlight important aspects and potential considerations.

JSON Transformation Pseudocode

In the SecOps environment, incoming log messages are initially stored within a field named 'message'. To standardize and analyze this data, we need to convert it into the Unified Data Model (UDM) schema. To effectively describe the transformation process, we will employ a pseudocode that draws inspiration from the JSONPath syntax you've learned earlier. It's crucial to remember that this pseudocode serves as a descriptive tool for illustrating the transformations and is not a formal syntax that needs to be strictly adhered to.

To illustrate JSON transformations, we'll employ a pseudocode with clear assignment operations. The arrow symbol (←) will denote assigning a value to a variable or field. For instance, x ← 5 signifies assigning the integer 5 to the variable 'x'. In the context of JSON, $.fixedToken ← "constantString" represents creating a new field named 'fixedToken' directly under the root node ('$') and assigning it the string value 'constantString'.

Our pseudocode can also express copying values between fields. In Copying by reference we use M to indicate accessing the value of a variable or field. So, x ← M[y] means 'take the value currently stored in 'y' and assign it to 'x'.

This concept extends to JSON structures. For instance, $.myToken ← $.message.username signifies creating a new field 'myToken' at the root level and assigning it the value found within the 'username' field, which is itself nested inside the 'message' field. This copies only the value, not the entire structure.

For example:

Input Log:

JSON

{ "event_type": "user_activity" }

Transformation:

Code snippet

$.myToken ← $.message.event_type

Output (UDM Schema):

JSON

{..., "myToken": "user_activity",... }

In our pseudocode, composite fields, which are essentially objects containing other fields, are denoted using curly braces {}. This visually represents the nested structure. For example, a field named 'system' that has subfield fields like 'hostname' and 'ip_address' would be represented as $.system{}. For brevity, we can also simply denote it as $.system, but keeping in mind that “system” has a nested structure.

This distinction between composite and simple fields is particularly important in GoStash. To make this clear in our pseudocode, we'll add {} after composite field names (like $.system{}). This helps you visualize the structure and apply GoStash transformations correctly, however you can drop this notation once you get more fluent.

To access a field nested within a composite field, you can use the dot notation, which is a common convention in JSONPath. For instance, to reference the 'hostname' field within the 'system' field, you would use $.system.hostname. Alternatively, you can explicitly denote the composite field using {}, like this: $.system{}.hostname. Both notations achieve the same result; the latter simply provides a visual cue that system is a composite field containing other fields.

To represent a repeated field (a field containing a list of values), we'll use square brackets ``. For example, the field Tags with multiple values would be shown as $.Tags or $.Tags[*].

"For fields containing a list of composite values (objects), we'll use {}. For example, a 'sessions' field containing a list of session objects, each with 'session_id', 'start_time', and 'actions', would be shown as $.user.sessions[*], $.user{}.sessions{}[*].

JSON Flattening in Go-Stash

Flattening is a common operation in JSON parsing that converts a repeated field, which is essentially an array, into a composite field with numbered keys. This transformation makes it easier to access individual elements within the repeated field. For example, if you have a repeated field $.Tags[*] with the values ["login_logs", "dev"], flattening it would result in $.Tags{} with the structure {"0": "login_logs", "1": "dev"}. After flattening, you can reference the first tag using $.Tags.0, the second tag using $.Tags.1, and so on. This provides a convenient way to work with individual elements within a previously repeated field.

As we describe transformations, our pseudocode will use a flexible approach to referencing fields. Sometimes, it will directly reference fields within the original input log structure, which is always nested under the 'message' field (e.g., $.message.somefield). Other times, the pseudocode might reference fields that have already been created or modified within the target schema during earlier transformation steps (e.g., $.some_new_field). Both approaches are valid and serve to illustrate the flow of data transformation.

Print Tokens Captured

filter {

statedump {}

}

Snippet from statedump output:

{

"@createTimestamp": {

"nanos": 0,

"seconds": 1736558676

"@enableCbnForLoop": true,

"@onErrorCount": 0,

"@output": [],

"@timezone": "",

"message": "{\n \"timestamp\": \"2025-01-11T12:00:00Z\", \n \".....

The statedump operation is a powerful debugging tool within the GoStash parser. It allows you to examine the internal state of the parser by printing all captured tokens at the point of its execution. This output is displayed on the right side of the parser UI, under the 'Internal State' section.
You'll notice that some tokens are pre-created by the parser, such as @output and @enableCbnForLoop. These are internal tokens used by GoStash for various purposes.
It's important to observe that your input log messages, which are initially contained within the message field, are treated as raw strings at this stage. This is indicated by the escape character \ preceding the double quotes within the message field. This raw string format signifies that the JSON data within the message field has not yet been parsed into a structured JSON object.

AD_4nXcG_pZSQCswHkVtszGRauxfLMald_cNTCzPQAXrIkyJEsY1boCKRb-A0u9qCkl8QP5_bPPrGMqE0oA3uru193bIE5ORnkQ6XSHZAi1j75Y9CYubWIAP3lxgVdC4cqVW3If11dsuA6S-qw8lQRjYg4T56ih3rviKt6BxghxLR3jCkMBB?key=YuSiKdRvbl45k-MQMKvjNzQd

Parse the JSON Schema and Flatten Repeated Fields

filter {

json { source => "message" array_function => "split_columns"}

statedump {}

}

Snippet from statedump output:

{

"@createTimestamp": {

"nanos": 0,

"seconds": 1736558735

"@enableCbnForLoop": true,

"@onErrorCount": 0,

"@output": [],

"@timezone": "",

"event_type": "user_activity",

"message": "{\n

…..

"system": {

"hostname": "server-001",

"ip_address": "192.168.1.100"

"timestamp": "2025-01-11T12:00:00Z",

"user": {.....

This code converts the unparsed JSON string in "message" field into a JSON schema.
In the SecOps environment, incoming log messages are initially placed within a 'message' field. This means that from the perspective of the SIEM, your log data is structured as follows:

JSON

{ "message": {...your log data... } }

This 'message' field acts as a container for your actual log data. To effectively parse and utilize the information within your logs, you need to convert this raw string representation into a structured JSON object. This is where the json filter comes in. It parses the content of the message field and transforms it into a usable JSON format, allowing you to access individual fields and values within your logs

The initial stage of our log parsing process involves setting up the input and output structures and performing some basic transformations.

Input: We define the entry point for our input logs using the pseudocode Input: $.message{} = {... }. This indicates that the incoming log messages are nested under the message field, as is the convention in SecOps SIEM.
Output: We declare the target schema, which will eventually be converted to the UDM format, using Output: $. This signifies that the root level ($) of our output structure will be where we construct the UDM schema.
Moving Fields: We then move all the first-level fields from the input log schema (under message) to the root level of the output schema. This is represented by the pseudocode $ ← $.message{}.
Flattening: Finally, we apply a flattening operation to all repeated fields within the data. This is denoted by $ ← flatten($.*). Flattening converts repeated fields (arrays) into a composite format, making them easier to access and manipulate.

To understand how json works, compare the JSON tree before and after. You'll see that fields like 'event_type' are no longer under 'message'; they're now at the same level as 'message

AD_4nXe6jmzqPmuJx898pyemaMpj3OpU5qtsYmZRTB8tVJ8X5Rviw3Rvl_w52x4KmQUVB3iSYD7r4EtGLvo5GgiENE7ZYue5_01EJEu3dAw9hBmCpewTCVlb6EZH2Ar8MruL1qujPI2KZ3_CSTk6dhjbuYllTHtOPyU_wzyCgGHGbDc5r1o?key=YuSiKdRvbl45k-MQMKvjNzQd

While without using the json parse clause, "event_type" is a subfield of the "message" field.

The array_function plays a key role in simplifying complex JSON structures by flattening repeated fields. Let's break down how it works:

Before Flattening: Imagine a user object with a sessions field that contains a list of session objects, like this:

JSON

"user": {

"sessions": [

{ "session_id": "abc-123",... },

{ "session_id": "def-456",... }

]

}

After Flattening: The array_function transforms this structure by converting the sessions list into an object with numbered keys:

JSON

"user": {

"sessions": {

"0": { "session_id": "abc-123",... },

"1": { "session_id": "def-456",... }

}

This makes it much easier to access individual session objects. For example, you can now directly reference the first session as $.user.sessions.0.

Before:

AD_4nXd3qZZpwsU1XtCjpIG6kW1QTEGFDtd6u1Pv1x1pDVgSBgkZ9WLSouxwH1bx3ema8HNEVg6nn3jj5sm94uCRnQRpMYx_8hOa1NGJFavrbP421ap5797JlVp-RdFFaA_ITzG5yHGFkT5mw7jEz10tUWlWSIdrSs062sxxdZ-8bov2630o?key=YuSiKdRvbl45k-MQMKvjNzQd

After:

AD_4nXeszBTIhHjKlba7Dnl9UInAF1odPt6vscLmEXmqIhgMYDvz_f6i7iIPgW2QAwH9ftw8ZmDsd9RKTNHup0vfmhFNTXWz_UDVaxuQDJK6ZI5jqPZMvudhB9dzDCwv3CRSkCge7CPRIr4l8JBySPgRjdu-zCMqEThzV1K2YIS_8yT3Dgo?key=YuSiKdRvbl45k-MQMKvjNzQd

The effect of flattening on the repeated Tags field:

Before Flattening: The Tags field is a list of values, represented as $.Tags[*]:

JSON

"Tags": ["login_logs", "dev"]

After Flattening: The array_function converts this list into an object with numbered keys, represented as $.Tags{}:

JSON

"Tags": {

"0": "login_logs",

"1": "dev"

}

Now you can easily access specific tags. For example, $.Tags.0 refers to "login_logs", and $.Tags.1 refers to "dev"."

Before

AD_4nXeInS8FUXz3FokHAL4GpdoOwa7M8JcI4zZUVFXKjwFGQSBSOfLgrrQ-gdMTa_WJmdAjX7oec0UnOTfQE6M9dDQjdjl3WjYu7ZzYPtbJRUvvfAca9EXgPMPEqo9ZaMz96Mh_6fSfX8pkmt0QL2ObS0MOMCYAZTwRN4KDNfpGL2HbZQvL?key=YuSiKdRvbl45k-MQMKvjNzQd

After

AD_4nXfYhpVlNBhK9CvnsbBmYUXZOdG9Ymg7MeODBCxNoSGNaf_zu6qXfiSk18gxyAWtiwHrERZ09GAhisF1LoANGdd_GJ7lPbTOLnKshGVD49vqTye0OHvD8eaDnFFBSHUzr7cCNg4Ij7ZyTsXuO_ErbyWYIvya3XPh8Kbd5ebqnNEcMOuG?key=YuSiKdRvbl45k-MQMKvjNzQd

This conversion allows using dot notation to access repeated fields the same way as the subfields of parent nodes of composite fields.

Tips

Always use "array_function" in the JSON parse clause whether you need to parse repeated fields or not

filter {

json { source => "message" array_function => "split_columns"}

for index, _tag in Tags map {

statedump {}}

}

Assign a String Constant

Task: Assign a string “Log Sample” to a new token(variable) “constantToken”

filter {

mutate { replace => { "constantToken" => "Log Sample" }}

statedump {}

}

Snippet from statedump output:

….

"constantToken": "Log Sample",

….

You can use the replace operation to assign constant values to tokens. In this case, $.constantToken ← "Log Sample" creates a new token called 'constantToken' and assigns it the string value 'Log Sample

AD_4nXc9tV_lRAUFE3DHbDx0sgM745zOKoIZkWEgiuLpWKjKjsN7IyMzMizewqZFo1aaTsOaXSJhrNMfwQTim7XzbF99Fn4ufmb6QFn4dsXWJcDgeMfl7qfqVAb_QV6nINmqGWMpQNTlhgJmbzDwf-hl7OWa1TavfX0utXcPYMxE1D2Vp_Uz?key=YuSiKdRvbl45k-MQMKvjNzQd

Capture a Simple String Token

Task: Capture the value of $.user_activity into a new token “myEventClass”

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "myEventClass" => "%{event_type}" }}

statedump {}

}

Snippet from statedump output:

…

"eventClass": "user_activity",

….

The equivalent variable assignment is similar to this ;

$← flatten($.message{})

$.myEventClass ← $.event_type

It's important to be aware of which schema you're referencing. If you're still working with the original input structure, then the equivalent of the second line would be $.myEventClass ← $.message{}.event_type, because the event_type field would still be nested under message

The replace operation can also be used to copy values from existing fields into new tokens. In this example, $.myEventClass ← $.event_type performs the following:

Creates a token: If a token named 'myEventClass' doesn't exist, it's created.
Copies the value: The value from the 'event_type' field is copied into the 'myEventClass' token.

This effectively creates a copy of the 'event_type' field's value under a new name ('myEventClass')."

AD_4nXc05l8S1pUENg1fLW1AlBqlajoIVboLjrX-0OTBefdZ1A4vXHquRUQhDobDbQhCjBYiPs3KSSqztTBae8m1spzHpFFazfzo_sYT-xM7O2_Mtbc_WBzgeEetzg533ZQGAQH8KKEjj0t4NqD_Dfnlujinf3K1RAImlhmBf39_gkSOdf80?key=YuSiKdRvbl45k-MQMKvjNzQd

Token names are case sensitive, "event_type" is not the same as "Event_type".

Capture a subfield String Token

Task : Capture the value of $.user{}.username into a new token “myUser”

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "myUser" => "%{user.username}" }}

statedump {}

}

Snippet from statedump output:

…

"myUser": "johndoe"

….

The equivalent variable assignment is similar to this ;

$ ← flatten($.message{})

$.myUser ← $.user{}.username

This assigns the value from the nested 'username' field (within 'user') to a new token called 'myUser'

AD_4nXdwxRfqvIK3t7zLHKexkXfuCBDM0m0ZidwrjuvW2UtEURz3Fdt43T4PktIVb96mt87s30reZAD4bKm7gAs_UP3VLeRp_xEmreuQYS14-X7Y0t3gt-rMnEx9OGNZWttaQXfU77AtKSexrPKJsKPXIJa0rg018yWHdgN-INhFqh9pceuL?key=YuSiKdRvbl45k-MQMKvjNzQd

Capture a Repeated String Field with Specific Order

Task: Capture the value of the first session ID $.user{}.sessions{}[0].session_id and the second tag $.Tags[0]

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "firstSession" => "%{user.sessions.0.session_id}" }}

mutate { replace => { "secondTag" => "%{Tags.1}" }}

statedump {}

}

Snippet from statedump output:

…

"firstSession": "abc-123",

….

Equivalent to ;

$.← flatten($.message{})

$.firstSession ← $.user{}.sessions{}.0.session_id

$.secondTag ← $.Tags{}[].1

The array_function in GoStash introduces a key difference in how you work with repeated fields compared to JSONPath.

JSONPath: In JSONPath, you would use $.Tags[*] to access all elements of the Tags array.
GoStash: In GoStash, because array_function flattens the array, you can access individual elements using dot notation, like %{Tags.1} for the second element.

This flattening behavior allows GoStash to treat repeated fields similarly to composite fields, providing a consistent way to access data using the dot notation."

Without array_function the variable assignment will fail,

i.e. %{Tags.1} is not supported without the "array_function" clause.

Task: Capture the first action type under the first action in the first session $.user{}.sessions{}[0].actions{}[0].action_type into “firstActionType”

### filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "firstActionType" => "%{user.sessions.0.actions.0.action_type}" }}

statedump {}

}

Snippet from statedump output:

…

"event_type": "user_activity",

….

Equivalent to ;

$← flatten($.message{})

$.firstActionType ← $.user.sessions.0.actions.1.action_type

Initialize an empty field

Task: Declare and Clear an empty token “emptyPlaceholder”

### Capture the first session ID

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "emptyPlaceholder" => "" }}

statedump {}

}

Snippet from statedump output:

…

"emptyPlaceholder": "",

….

Equivalent to ;

$.← flatten($.message{})

Declare Variable: string temp = ""

In previous examples, we directly assigned values to tokens without explicitly initializing them. However, there are specific scenarios where initialization becomes crucial:

Looping: When using loops to iterate over data, you often need to initialize a token to store values or accumulate results within the loop.
Looped Concatenation: If you're combining values within a loop, you need an initialized token to store the concatenated result.
Repeated Lists: When working with repeated lists (arrays), initializing a token can help you manage and manipulate the elements effectively.
Deduplication: Removing duplicate entries from a list often requires an initialized token to track unique values.

In these cases, initializing the token ensures that it exists and is ready to be used in the intended operation. We'll explore these scenarios in more detail in the following sections.

Capture a field -if exists- with Exception Handling

Task: Capture the third Tag if it exists without raising any parsing errors if it does not.

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "my3rdTag" => "%{Tags.2}"} on_error => "3rdTag_isAbsent"}

statedump {}

}

Snippet from statedump output:

…

"3rdTag_isAbsent": true,

…

Equivalent to ;

$← flatten($.message{})

$.my3rdTag ← $.Tags{}[].3, ifError raiseFlag: $.3rdTag_isMissing ← true

This scenario demonstrates GoStash's robust error handling when dealing with repeated fields.

Accessing Non-Existent Elements: When you attempt to access an element that doesn't exist within a flattened repeated field (e.g., trying to access the 3rd tag when there are only two), GoStash doesn't halt the parsing process.
Raising a Flag: Instead, it raises a flag to indicate that the element was not found. In this case, the flag is named '3rdTag_isAbsent'. This allows you to identify potential issues in your data or parsing logic.
Continuing Execution: Importantly, GoStash continues to execute the remaining lines of your parsing configuration. This ensures that even if there are minor inconsistencies in your data, the parsing process can still proceed and extract valuable information from other parts of the logs.

The on_error clause is a powerful feature in GoStash that provides several benefits:

Exception Handling: It allows you to gracefully handle exceptions or errors that might occur during parsing, such as trying to access a non-existent field.
Field Existence Detection: You can use on_error to check whether a recurring or optional field is present in the data.
Debugging Probe: It can act as a debugging tool, allowing you to raise flags or perform actions when specific conditions are met or not met.

In this example, without on_error, trying to access the non-existent 3rd tag would result in a compilation error, halting the parser execution. However, with on_error, GoStash raises a flag named '3rdTag_isAbsent' and sets it to 'True', allowing the parser to continue executing. This demonstrates how on_error can prevent complete parser failure and provide more robust error handling."

I.e. Using "mutate { replace => { "my3rdTag" => "%{Tags.2}"}}" statement without "on_error" will generate a compiling error, and will stop the parser execution as there is no 3rd tag field in this example.

By combining on_error with conditionals, you can create 'smart' parsing rules that adapt to different log formats. For instance, you can check if a field like 'query' is present and then trigger specific actions based on its existence.

Capture a subfield Non-String Field

Task: Capture $.user{}.id integer field

filter {

json { source => "message" array_function => "split_columns"}

mutate {convert => {"user.id" => "string"} on_error => "userId_conversionError"}

mutate { replace => { "myUserId" => "%{user.id}" }}

mutate {convert => {"myUserId" => "integer"}}

statedump {}

}

Snippet from statedump output:

…

"myUserId": 12345,

"userId_conversionError": false

….

Equivalent to ;

$← flatten($.message{})

$.user{}.id ← string($.user{}.id), IfError: raiseFlag $.userId_conversionError ← True

$.myUserId ← $.user{}.id

$.myUserId ← integer($.myUserId)

The replace operator in GoStash is primarily designed for working with string fields (textual data). If you need to capture a field that is not a string (e.g., a number, boolean, or date), you'll need to follow a specific process:

Convert to String: Use the appropriate conversion function to convert the non-string field into a string representation.
Assign with replace: Use the replace operator to assign the converted string value to a token.
Convert Back to Original Type: Use another conversion function to convert the token back to its original data type.

While both convert and replace work with fields, there's a slight difference in how you reference those fields:

convert: This operator expects field names to be provided directly, without any enclosing symbols. For example, you would use field_name or nested_field.sub_field.
replace: This operator requires field names to be enclosed in %{}. For example, you would use %{field_name} or %{nested_field.sub_field}.

Keep this distinction in mind to avoid syntax errors when using these operators

Sample Token Types:

Assume an input message processed by the parser below ;

{ "booleanField": true,

"booleanField2": 1,

"floatField": 14.01,

"integerField": -3,

"uintegerField": 5,

"stringField": "any single-line string is here"}

filter {

json { source => "message"}

statedump {}

}

Snippet from statedump output:

{

"booleanField": true,

"booleanField2": 1,

"floatField": 14.01,

"integerField": -3,

"stringField": "any single-line string is here",

"uintegerField": 5

}

In the example, GoStash automatically recognized and handled the various data types: boolean fields were parsed as booleans, integer fields as integers, and so on.

The supported target data types for the convert statement are provided in https://cloud.google.com/chronicle/docs/reference/parser-syntax#convert_functions :

boolean

float

hash

integer

ipaddress

macaddress

string

uinteger

hextodec

hextoascii

Constructing Nested String Elements

Task: Capture $.event_type and $.date into a Hierarchical token “grandparent.parent.eventType” and “othergrandparent.otherparent.date”

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "grandparent.parent.eventType" => "%{event_type}" }}

mutate { replace => { "myTimestamp" => "%{timestamp}" }}

mutate { rename => { "myTimestamp" => "othergrandparent.otherparent.date" }}

statedump {}

}

Snippet from statedump output:

… :

"grandparent": {

"parent": {

"eventType": "user_activity"

"othergrandparent": {

"otherparent": {

"date": "2025-01-11T12:00:00Z"

}

…

Equivalent to ;

$← flatten($.message{})

$.grandparent{}.parent{}.eventType ← $.eventType

$.myTimestamp ← $.timestamp

Rename $.myTimestamp ⇒ $.othergrandparent.otherparent.date

When capturing string fields using the replace operator, you have the ability to add any number of hierarchical parent levels to the captured token. This allows you to organize your data into a nested structure.

Creating Nested Structures: The first line $.grandparent{}.parent{}.eventType ← $.eventType demonstrates how to create nested fields using the replace operator. It creates a hierarchy of fields named 'grandparent' and 'parent' and places the 'eventType' field inside this nested structure.

Renaming and Moving Fields: The next two lines show how to rename and move a field using the rename operator.
1. $.myTimestamp ← $.timestamp first creates a copy of the 'timestamp' field named 'myTimestamp'.
2. Rename $.myTimestamp ⇒ $.othergrandparent.otherparent.date then renames 'myTimestamp' to 'date' and moves it under a new nested structure with 'othergrandparent' and 'otherparent' levels.

However, the rename operator offers a more flexible way to achieve the same result. With rename, you can move existing fields and tokens within the JSON hierarchy, giving you more control over the final structure. We'll explore the capabilities of rename in a later section.

Next Step: Security Operations: Deep Dive into UDM Parsing - Part 1.3

Capturing Basic JSON fields

JSON Transformation Pseudocode

JSON Flattening in Go-Stash

Print Tokens Captured

Parse the JSON Schema and Flatten Repeated Fields

Tips

Assign a String Constant

Capture a Simple String Token

Capture a subfield String Token

Capture a Repeated String Field with Specific Order

Initialize an empty field

Capture a field -if exists- with Exception Handling

Capture a subfield Non-String Field

Constructing Nested String Elements

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded