Security Operations: Deep Dive into UDM Parsing - Part 1.1

Deep Dive into UDM Parsing – How I learned to stop worrying and love the "log"

This guide explores UDM (Unified Data Model) parsing, focusing on transforming JSON logs into the Google UDM schema. We'll cover log schema analysis, custom parser creation, and leveraging Jinja templates for efficient development.

Problem Formulation

Log parsing involves transforming various log formats (free-text, XML, CSV, JSON) into UDM's structured JSON format for Events and Entities. UDM requires specific field types and nested structures, beyond simple text placeholders. In short, UDM is a schema !

Prologue

We have different types of logs incoming to the SIEM with different formats, either ;

a. Free-text

b. XML

c. CSV

d. JSON .

These formats are eventually transformed into either one of 2 JSON formats, the Unified Data Model UDM https://c

This is a detailed high-level workflow for the parser design process. loud.google.com/chronicle/docs/reference/udm-field-list for Events or for Entities, through a series of Gostash https://cloud.google.com/chronicle/docs/reference/parser-syntax statements (Google’s Logstash version) within the log parsing process

The UDM Events are what you would see as raw attributes in the SIEM UDM search to describe the event logs, while the entities logs are what you would see as Enriched attributes in the UDM search and entities objects in the entities explorer.

We are going to cover these use cases.

Parser Design Workflow

AD_4nXf0UIJOL3jD-1HYTpEYIr4WM9v794Mja9UWDo98m8PalWHvUTjWx7FacLRk2F0jB2ttd67aK3S-MsQM4WCVtP_LFuXZ8X6ha9D28u8Bm4nXwMjEW3EitR-RdnebOk4rMqQ7owtdXHZH6mHg22nVsTisYVyNSOYDACHKxTBJmAxW4ZUG?key=YuSiKdRvbl45k-MQMKvjNzQd

Identify Log Type and Key Fields: Determine the log source and its essential fields (e.g., timestamp, severity, user).
Analyze Schema and Format: Examine the log structure (JSON, free-text, etc.), identify data types, and note recurring/permanent fields.
Define Pre-Tokenization Conditions: Specify conditions for parsing (e.g., only logs with WARN severity or higher).
Tokenize Required Fields: Extract the identified fields using Gostash (Google's Logstash).
Apply Post-Tokenization Conditions: Add attributes based on extracted data (e.g., "internal" for corporate users).
Map to UDM: Assign extracted tokens to the corresponding UDM schema fields.
Validate the Parser: Test the parser using the SIEM's validation feature (1k logs preferred) or the ingestion API https://cloud.google.com/chronicle/docs/reference/ingestion-api or CBN tool https://github.com/chronicle/cbn-tool for smaller samples.
Document and Backup: Maintain a version-controlled backup of your parsers for tracking changes and rollback capabilities.
Monitor Performance: Regularly monitor parser performance, especially after vendor updates that might alter log formats.

UI Navigation

1. Creating a new Parser : Go to the Settings > SIEM Settings > Parser Settings > Add Parser > Select "UDM" –or the ingestion label of your data source, if not present then raise a case with Support to add it - > Create

AD_4nXc2t69FDEs6t25Dr51QJ5UwUfUOtRN0QUSrqKAHKXUz-kokl-aN3syfyBZmKh8ZINPjZry818KPuEyYANpx7S1I1S5pJHXnqDVeBYfTG_YBL9NRVLX_l2VFwmmXNIuEQaTdZAzt5-qpl56Egp-2eOvtQ42rCPdXs_Zp1Z1iwqhB08au?key=YuSiKdRvbl45k-MQMKvjNzQd

Approach

This series uses a practical, use-case-driven approach, focusing on common JSON log parsing scenarios and regular expressions. CSV parsing is straightforward, while key-value pairs require different techniques. The process involves tokenization (capturing data) and mapping (assigning to UDM).

Useful Tools

LLMs: Generate sample logs for testing and practice.
Jinja Templates: Simplify parser development by using Jinja's text generation capabilities. (Note: This series won't be a full Jinja tutorial.)
JSONPath: Understanding JSONPath can aid in clarifying UDM syntax (optional).
JSON Tree Viewers: Visualize JSON log structures for schema understanding.

Tokenization

Tokenization is the process of capturing fields from the log message and assigning them into a new field name (target token).

We will be using GenAI to generate a sample log message, this is the sample log message used during this guide ;

{

"timestamp": "2025-01-11T12:00:00Z",

"event_type": "user_activity",

"user": {

"id": 12345,

"username": "johndoe",

"profile": {

"email": "john.doe@example.com",

"location": "New York",

"VIP" : true

"sessions": l

{

"session_id": "abc-123",

"start_time": "2025-01-11T11:30:00Z",

"actions": p

{"action_type": "login", "timestamp": "2025-01-11T11:30:00Z", "targetIP":"10.0.0.10"},

{"action_type": "search", "query": "weather", "timestamp": "2025-01-11T11:35:00Z", "targetIP":"10.0.0.10"},

{"action_type": "logout", "timestamp": "2025-01-11T11:45:00Z", "targetIP":"10.0.0.11"}

]

{

"session_id": "def-456",

"start_time": "2025-01-11T12:00:00Z",

"actions": p

{"action_type": "login", "timestamp": "2025-01-11T12:00:00Z", "targetIP":"192.168.1.10"}

]

}

]

"system": {

"hostname": "server-001",

"ip_address": "192.168.1.100"

"Tags": p"login_logs", "dev"]

}

It is very useful to view JSON logs as a tree structure, there are multiple tools available including VSCode, in this guide I will be using the online version available at https://jsonviewer.stack.hu/ , make sure you use sanitized logs or approved JSON viewer tools internally.
The JSON Tree for the log sample above is ;;

AD_4nXc_B8vo81UUi0IqZxZxDkSyrw9FLBWEp1S_weDRtHTQe6fYJvwk1kw1qx31zhyGCQhO71opxJY_7KU4aJwf8XZGx0fjfjFHlTdX6cXpHDuHfxDqlf25WrCGQ_jot3-lFJ4GfW11WHTI78f5QnQ-PY8LFxMNYpstRKpCxpf5LowhwlHz?key=YuSiKdRvbl45k-MQMKvjNzQd

Schema Identification

To optimize log parsing, correctly define field types. Schema inference tools streamline this, especially for large datasets. In JSON logs, data fields are organized as key-value dictionaries in Python, enabling efficient processing

For the example above we identify the types of several data fields. JSON strings are generally represented as dictionaries in Python where field names are the keys.

Field Name	Data Type	Multiplicity	Hierarchy
Event_type	Primitive - String	Single-valued	Flat
Timestamp	Primitive- Date	Single-valued,	Flat	The field type is Simple because the field is on the topmost level of the JSON log, i.e. it is not nested under other fields All date fields are strings that follow a particular date format. For this example this date field follows ISO 8601 format. 2025-01-11 → Date in YYYY-MM-DD format T → Separator between date and time 12:00:00 → Time in HH:MM:SS (24-hour format) Z → Zulu time (UTC, Coordinated Universal Time)
System{}	Composite (Object or dictionary)	Single-Valued	Nested	Composite field because its value has 2 other fields (dictionaries) These fields have a hierarchy, making the JSON a tree structure. Notice the curly brackets next to the field name, this is an indicator that this field is composite. In order to access the composing nodes, we use the json dot notation, i.e. "System.hostname" and "System.ip_address" are how we access the subfields nodes of "System".
user{}	Composite (Object or dictionary)	Single-valued	Nested	Composite because, like "system", "User" is composed of different fields underneath. Nested fields values are fields that contain other json fields ({id:..., username:...,...}) in a hierarchy. In Python this is analogous to a key whose value is a dictionary.
user.VIP	Primitive-Boolean	Single-valued	Simple	Boolean fields are either true or false strictly lowercase without any quotes. Boolean fields are not strings as they have no enclosing double quotes "". This field "user.VIP" is a subfield field of "user" field. This field is a subfield node of User parent field but it has no subfields nodes, so it is a simple in terms of hierarchy even though it is accessed through its parent node "users" As you could notice, accessing the nested a hierarchy through JSON is done using dot notation indicating parent-subfield node relationship, for example here it is "users.vip" indicating the path to access "vip" through its parent node "users".
User.id	Primitive-Integer	Single-valued	Flat	Integer fields are digits without double quotes. Integer fields (e.g. 12345) are handled differently from string fields surrounded by double quotes i.e. "12345" =! 12345, the left is a string of integer while the right is an integer. The field "user.id" is flat because it has no subfields nodes, even though it has parent node "user"
Tags	Primitive-Integer-Repeated	Mutli-valued (Repeated, List)	Flat	Flat because it has no untested JSON fields, notice that "Tags" value is a list, not a dictionary (unlike "system") Each element in the list structure a "String", so its data type is string but repeated, so we can consider this field a Repeated-Atomic field. Notice the brackets "T ]" on the left of the field name, this is an indicator that this field is not composite but a repeated one. Repeated fields are accessed slightly different than composite fields, for example the second tag is accessed as "tags01]" not "tags.1"
User.Sessions {}	Dictionary (Composite)-Repeated	Mutli-valued (Repeated, List)	Nested	This is the most complex field type, it has all the sub-types discussed ; Repeated as it is composed of 2 Session Objects (0 and 1 both branching from the bracket next to "Sessions" field name). Each element in the list is a dictionary(object), so its Data type is dictionary-repeated or Repeated-Composite. The field is nested because each element in the list has nested fields like "user.sessions<0].actionsn1]"
user.sessions {} .actions{} .action_type	String	Single Valued (Mandatory)	Flat	Mandatory Fields Appear in ALL of their parents' instances. I.e. Every "actions" object has an "action_type"
user.sessions {} .actions{} .query	String	Single Valued (Optional)	Flat	Optional Fields Appear in SOME of their parents' instances. I.e. Not every "actions" object has a "query", but only the "search" actions will have it .

Exporting Logs from Python Scripts

Tip: When exporting logs from Python scripts, always make sure to ;

Switch the single quote ` to double quotes ".
Keep True/False are both lowercase.
Remove None or Null and convert them to strings as "None" or "Null ".
Pasting logs with formatting errors (like incorrect JSON or unbalanced brackets) into the parser UI will result in an error message.

If the logs are in the correct JSON format, they will appear like this in the UI ;

AD_4nXefPAz7Yk3OnRndoxbiwAEwVdmB0-sX_9UK9s026im29mgo7viqgiveCvLEBHt1j-TxkpDlRf6tjclWcvURIiCf6IQDiGHsN-Sbg8enpHd1fkti0w3k5UXRYvdQ34-oFihk_A8dArKVSS89umMOnvphNrdGc4oTr_rNUN8lleA5_5Du?key=YuSiKdRvbl45k-MQMKvjNzQd

JSON Path Basics

This section covers the fundamentals of JSON Path, a query language for navigating and extracting data from JSON documents. Using a JSON Path evaluator, paste sample log messages and construct queries to select specific JSON fields. This will help you visualize JSON schemas and understand the distinctions between JSON Path and the Google GoStash parser syntax

In general in GoStash, the first steps in any JSON parsers are ;

Use the JSON parse clause with serializing array fields -more into that later-, done by ;

filter {

json { source => "message" array_function => "split_columns"}

statedump {}}

}

GoStash uses distinct syntaxes for field referencing, determined by the operator (e.g., assignment, conditional, loop, merge). This section will delve into the 'replace' operator, loops and if-conditionals, given their prevalence. The 'merge' operator will be addressed in a dedicated section

There are lots of JSON Path evaluators, we will be using https://jsonpath.com/ developed by “Hamasaki Kazuki” in this guide.

Select The Whole JSON

AD_4nXeCBTtTf6to9D6Vd9lJsioM3mD9aj3tWmcbKcifRhRwJCQd72dMGAEz4b84K38ygbnhwfDnBi_KblKnnonbY8W0GYmM0QktDkSrdgsIgP2eOPUK3slVjtnUiZE-qam5NT4dmaTFmh6dzQ7JfIXhloK5aigKef4ZKWr5IolBrSy9ZwRG?key=YuSiKdRvbl45k-MQMKvjNzQd

Operator $ Selects the root node of the JSON file in JsonPath, it represents the root node of a JSON object.
It returns a list that has 1 element which is the json log message.

In GoStash; Input log messages fall under a default root field name called "message", so we could say that; "message" field name is the equivalent to the JSON Path $.

Note: JSON path returns a list by default, i.e. query a log message with $ will return p …<log message content>... ] , but for the sake of simplicity, we will assume it returns just the log message (without the list) when compared with Gostash and JSONPath.

For example to select a whole JSON log using "$" ;

AD_4nXel2CsnE-oM4FkLz51onGqyN9r8-kXdtPv-RYteDBBxDrXAhWTJwQlS29ocBECwHIP82fWZPh3_jVj-BpsSmaxiG2GJEovls84Q-BhKLdu303w854M27g7_eRzg99LBgFcYGX-v8i8uAkLXk5Dw7gtXdYLFEe7fcqm3_v440kYHUIMX?key=YuSiKdRvbl45k-MQMKvjNzQd

The actual return is ; 7{"user": "Alex"}] not {"user": "Alex"}, for simplicity we will ignore this in the following sections.

Select a Simple field

$.event_type

AD_4nXdPgat53cN4VMpRf2rMYjiqPLM5ml8M2YMRs5t5euLf9yFOQVHzlRXqLsoPkpQdX4vO9Y3JZ98b6GATXPfD1GI79q04-U2HdKvanczSkPRbBNYzz9R1NDr85QUYUNgKoHPo_ANLEueP2deqSbkYB8KpIBqvRvCjtkbH1asRDGIV8yrU?key=YuSiKdRvbl45k-MQMKvjNzQd

The first operator $ selects the root json log, while the dot operator "." moves the selection one level down to the subfields nodes, and "event_type" picks the subfield node "event_type, So in effect this selects the value of the event_type field which is “user_activity”

In Gostash, to reference the same field :

Assignment Operator : Use "%{event_type}".

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "myVariable" => "%{event_type}" }}

}

IF conditional : use /event_type]

filter {

json { source => "message" array_function => "split_columns"}

if pevent_type]=="abc"{

}

Loop : Not supported for simple fields.

Select a subfield field

$.user.id

AD_4nXc7JOEWkuiXBYy2GP96DGkT_P3qi0CXNNrnXU_ZAlHPYzQqTz91gqA2z5sypKdmNoF5YP0kPNh-y9Xiey4D8Kz4LGzjpSp9m7CXxVBc2fLtQ0TQlUeZZnId5rVbQWWM-n5g_JIGi1BXPIeg2zxW8xBjGZ2ObySnYUJ3qpwZ-MJoYRc?key=YuSiKdRvbl45k-MQMKvjNzQd

Same like above ; $ selects the root json log, "." moves one level down to the subfields nodes, "event_type" picks the subfield node "event_type", but we move 1 step further and reference the 2nd level field "id"

In Gostash, to reference the same field :

Assignment Operator :

I. For non-string fields :

Not Supported. For example user.id is an integer not a string, so

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "myVariable" => "%{user.id}" }}

}

Will give an error, as "user.id" is an integer field not a string field -as we highlighted earlier.

II. For string fields:

Supported ; e.g. for "user.username" field is a string field, referenced by %{user.username}

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "myVariable" => "%{user.username}" }}

}

IF conditional : Use 0user]iid] or /user]

filter {

json { source => "message" array_function => "split_columns"}

if suser]"id] == 13 {

}

}

IF Conditionals in logstash use bracket notation instead of dot notation to reference nested fields.

IF conditionals support both string and non-string data types.

Loop : Not supported for simple fields.

Select a Repeated field

$.Tags

AD_4nXcwk5DhbHYuaJF3u3be9WOQQSBXpuJwEeksIGiZJL3rhv9WqVbthrQblX1p6v_6HwD09j1OW3qujkBaO_rqqvp0NTYkDqdKN2QvHvFKZQepVQ6WYoFc7yxLPycUA6gcC3R7BQHnDljFZFkbwTmvG7AcXgk-55VzW3QbD3SVxGn7nbiy?key=YuSiKdRvbl45k-MQMKvjNzQd

The syntax is similar to the above cases, but requires appending "

" to indicate querying ALL the repeated field values.

If we need to access the 2nd tags, The JSON Path is $.Tags01]

In Gostash, to reference the 2nd Tag field :

Assignment Operator :

With "array_function":

Accessing all values of a repeated field: Not supported.

In GoStash, you cannot use wildcard syntax (like %{Tags.*}) to access all items in a repeated field. You must reference specific elements by their index, such as the first or second tag.filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "myVariable" => "%{Tags.*}" }}

statedump {}

}

Accessing a specific value of a repeated field: Supported

"%{Tags.1}" is supported

filter {

json { source => "message" array_function => "split_columns"}

mutate { replace => { "myVariable" => "%{Tags.1}" }}

statedump {}

}

Without "array_function":

Not supported. Repeated fields won’t be accessible, i.e. %{Tags.1} is not possible if the JSON is parsed without the "array_function";

filter {

json { source => "message"}

mutate { replace => { "myVariable" => "%{Tags.1}" }}

statedump {}

}

"array_function" should be always used to allow accessing repeated fields.

IF conditional : supported with "array_function" but uses the bracket notation

filter {

json { source => "message" array_function=>"split_columns"}

if cTags]n1] == "dev" {

statedump {}}

}

Without array_function: Not supported and will give an error ;

filter {

json { source => "message" }

if 1] == "dev" {

statedump {}}

}

Loop :

I. With array_function: Supported ;

filter {

json { source => "message" array_function => "split_columns"}

for index, _tag in Tags {

statedump {}}

}

The loop will execute two times, corresponding to the two values in the 'Tags' field. Pay attention to the use of 'Tags' without the '%{}' syntax. We'll cover the reason for this in a later section.

GoStash's syntax might allow you to write loops referencing specific indexes of repeated fields (e.g., Tags.0), but this will result in a logical error, and the loop's content will be skipped.

filter {

json { source => "message" array_function => "split_columns"}

for index, _tag in Tags.0 {

statedump {}}

}

Ii. Without array_function: loop won't produce a syntax error, but it will fail to execute. This is a logical error, meaning the loop is structurally incorrect.

filter {

json { source => "message" }#array_function => "split_columns"}

for index, _tag in Tags {

statedump {}}

}

Select a Composite Field

$.system or $.system"*o

AD_4nXdm97gCnHtLVjqwBi5EYi4QWaLw-gWoUlj0n9oG4t8rm-9vqEKAuG8q1u0QLQtFxpDm5EzERxOt7Mu-DLZRI_alyoSKAARh0GvGUDvv-iNwFHUWboOQP44Lf4NM0mfPyxzw5Psq4NHj4B0S8GlTh967PaMirf7zxqeIO4YrrkF5Ryqo?key=YuSiKdRvbl45k-MQMKvjNzQd

AD_4nXfOhQxG-ZucN4MGXC0vRuFgKOKY3SFkfsemBqc_0N1vosdW3wqucKJRRuxuIAsKmcrbepu38JlyEN7OLxD3q9zCBO8dYYxt9OAlFZHJVphCTH14_1MQOzoZkdaSsgFQ_d6SsZSiW1itY-fOqVFD_LTYG2Zonr5lwMrl66QCISpoMqQ?key=YuSiKdRvbl45k-MQMKvjNzQd

This example is a particular distinction between JSON Path and Gostash. You can select the composite field in JSONPath same way as selecting the simple fields using $.system , but it will generate a list of objects ;

{

"hostname": "server-001",

"ip_address": "192.168.1.100"

}

]

Appending

to the end as in $.system

will flatten json object inside the list to be ;

"server-001",

"192.168.1.100"

]

In Gostash, there is no direct equivalent to $.system

. Referencing a sub-field in a composite field is done the same way as the subfield fields but explicitly, i.e. %{user.username} but there is there is no syntax for something like %{user.*} .

Looping is supported for composite fields, you can loop for each sub-field ;

Assignment Operator : Supported for explicit sub-field (as indicated earlier).

filter {

json { source => "message"}

mutate { replace => { "myVariable" => "%{system.hostname}" }}

statedump {}

}

IF conditional : Supported using Bracket Notation (as indicated earlier).

filter {

json { source => "message"}

if asystem] hostname] == "server-001" {

statedump {}

}

Loop : Looping through a composite field is supported using the "map" keyword.

filter {

json { source => "message"}

for index, systemDetail in system map {

statedump {}}

}

This is a distinction from looping through repeated fields which does not require the "map" keyword. This will be discussed later in more details.

Be the first to reply!

Deep Dive into UDM Parsing – How I learned to stop worrying and love the "log"

Problem Formulation

Prologue

Parser Design Workflow

UI Navigation

Approach

Useful Tools

Tokenization

Schema Identification

Exporting Logs from Python Scripts

JSON Path Basics

Select The Whole JSON

Select a Simple field

Select a subfield field

Select a Repeated field

Select a Composite Field

Reply

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded