Author: Hafez Mohamed
Deep Dive into UDM Parsing – How I learned to stop worrying and love the "log"
This guide explores UDM (Unified Data Model) parsing, focusing on transforming JSON logs into the Google UDM schema. We'll cover log schema analysis, custom parser creation, and leveraging Jinja templates for efficient development.
Problem Formulation
Log parsing involves transforming various log formats (free-text, XML, CSV, JSON) into UDM's structured JSON format for Events and Entities. UDM requires specific field types and nested structures, beyond simple text placeholders. In short, UDM is a schema !
Prologue
We have different types of logs incoming to the SIEM with different formats, either ;
a. Free-text
b. XML
c. CSV
d. JSON .
These formats are eventually transformed into either one of 2 JSON formats, the Unified Data Model UDM https://c
This is a detailed high-level workflow for the parser design process. loud.google.com/chronicle/docs/reference/udm-field-list for Events or for Entities, through a series of Gostash https://cloud.google.com/chronicle/docs/reference/parser-syntax statements (Google’s Logstash version) within the log parsing process
The UDM Events are what you would see as raw attributes in the SIEM UDM search to describe the event logs, while the entities logs are what you would see as Enriched attributes in the UDM search and entities objects in the entities explorer.
We are going to cover these use cases.
Parser Design Workflow
- Identify Log Type and Key Fields: Determine the log source and its essential fields (e.g., timestamp, severity, user).
- Analyze Schema and Format: Examine the log structure (JSON, free-text, etc.), identify data types, and note recurring/permanent fields.
- Define Pre-Tokenization Conditions: Specify conditions for parsing (e.g., only logs with WARN severity or higher).
- Tokenize Required Fields: Extract the identified fields using Gostash (Google's Logstash).
- Apply Post-Tokenization Conditions: Add attributes based on extracted data (e.g., "internal" for corporate users).
- Map to UDM: Assign extracted tokens to the corresponding UDM schema fields.
- Validate the Parser: Test the parser using the SIEM's validation feature (1k logs preferred) or the ingestion API https://cloud.google.com/chronicle/docs/reference/ingestion-api or CBN tool https://github.com/chronicle/cbn-tool for smaller samples.
- Document and Backup: Maintain a version-controlled backup of your parsers for tracking changes and rollback capabilities.
- Monitor Performance: Regularly monitor parser performance, especially after vendor updates that might alter log formats.
UI Navigation
1. Creating a new Parser : Go to the Settings > SIEM Settings > Parser Settings > Add Parser > Select "UDM" –or the ingestion label of your data source, if not present then raise a case with Support to add it - > Create
Approach
This series uses a practical, use-case-driven approach, focusing on common JSON log parsing scenarios and regular expressions. CSV parsing is straightforward, while key-value pairs require different techniques. The process involves tokenization (capturing data) and mapping (assigning to UDM).
Useful Tools
- LLMs: Generate sample logs for testing and practice.
- Jinja Templates: Simplify parser development by using Jinja's text generation capabilities. (Note: This series won't be a full Jinja tutorial.)
- JSONPath: Understanding JSONPath can aid in clarifying UDM syntax (optional).
- JSON Tree Viewers: Visualize JSON log structures for schema understanding.
Tokenization
Tokenization is the process of capturing fields from the log message and assigning them into a new field name (target token).
We will be using GenAI to generate a sample log message, this is the sample log message used during this guide ;
{
"timestamp": "2025-01-11T12:00:00Z",
"event_type": "user_activity",
"user": {
"id": 12345,
"username": "johndoe",
"profile": {
"email": "john.doe@example.com",
"location": "New York",
"VIP" : true
},
"sessions": l
{
"session_id": "abc-123",
"start_time": "2025-01-11T11:30:00Z",
"actions": p
{"action_type": "login", "timestamp": "2025-01-11T11:30:00Z", "targetIP":"10.0.0.10"},
{"action_type": "search", "query": "weather", "timestamp": "2025-01-11T11:35:00Z", "targetIP":"10.0.0.10"},
{"action_type": "logout", "timestamp": "2025-01-11T11:45:00Z", "targetIP":"10.0.0.11"}
]
},
{
"session_id": "def-456",
"start_time": "2025-01-11T12:00:00Z",
"actions": p
{"action_type": "login", "timestamp": "2025-01-11T12:00:00Z", "targetIP":"192.168.1.10"}
]
}
]
},
"system": {
"hostname": "server-001",
"ip_address": "192.168.1.100"
},
"Tags": p"login_logs", "dev"]
}
It is very useful to view JSON logs as a tree structure, there are multiple tools available including VSCode, in this guide I will be using the online version available at https://jsonviewer.stack.hu/ , make sure you use sanitized logs or approved JSON viewer tools internally.
The JSON Tree for the log sample above is ;;
Schema Identification
To optimize log parsing, correctly define field types. Schema inference tools streamline this, especially for large datasets. In JSON logs, data fields are organized as key-value dictionaries in Python, enabling efficient processing
For the example above we identify the types of several data fields. JSON strings are generally represented as dictionaries in Python where field names are the keys.
Field Name | Data Type | Multiplicity | Hierarchy | |
Event_type | Primitive - String | Single-valued | Flat | |
Timestamp | Primitive- Date | Single-valued, | Flat | The field type is Simple because the field is on the topmost level of the JSON log, i.e. it is not nested under other fields All date fields are strings that follow a particular date format. For this example this date field follows ISO 8601 format. 2025-01-11 → Date in YYYY-MM-DD format T → Separator between date and time 12:00:00 → Time in HH:MM:SS (24-hour format) Z → Zulu time (UTC, Coordinated Universal Time) |
System{} | Composite (Object or dictionary) | Single-Valued | Nested | Composite field because its value has 2 other fields (dictionaries) These fields have a hierarchy, making the JSON a tree structure. Notice the curly brackets next to the field name, this is an indicator that this field is composite. In order to access the composing nodes, we use the json dot notation, i.e. "System.hostname" and "System.ip_address" are how we access the subfields nodes of "System". |
user{} | Composite (Object or dictionary) | Single-valued | Nested | Composite because, like "system", "User" is composed of different fields underneath.
Nested fields values are fields that contain other json fields ({id:..., username:...,...}) in a hierarchy. In Python this is analogous to a key whose value is a dictionary. |
user.VIP | Primitive-Boolean | Single-valued | Simple | Boolean fields are either true or false strictly lowercase without any quotes. Boolean fields are not strings as they have no enclosing double quotes "". This field "user.VIP" is a subfield field of "user" field. This field is a subfield node of User parent field but it has no subfields nodes, so it is a simple in terms of hierarchy even though it is accessed through its parent node "users" As you could notice, accessing the nested a hierarchy through JSON is done using dot notation indicating parent-subfield node relationship, for example here it is "users.vip" indicating the path to access "vip" through its parent node "users". |
User.id | Primitive-Integer | Single-valued | Flat | Integer fields are digits without double quotes. Integer fields (e.g. 12345) are handled differently from string fields surrounded by double quotes i.e. "12345" =! 12345, the left is a string of integer while the right is an integer. The field "user.id" is flat because it has no subfields nodes, even though it has parent node "user" |
Tags | Primitive-Integer-Repeated | Mutli-valued (Repeated, List) | Flat | Flat because it has no untested JSON fields, notice that "Tags" value is a list, not a dictionary (unlike "system") Each element in the list structure a "String", so its data type is string but repeated, so we can consider this field a Repeated-Atomic field. Notice the brackets "T ]" on the left of the field name, this is an indicator that this field is not composite but a repeated one. Repeated fields are accessed slightly different than composite fields, for example the second tag is accessed as "tags01]" not "tags.1" |
User.Sessions {} | Dictionary (Composite)-Repeated | Mutli-valued (Repeated, List) | Nested | This is the most complex field type, it has all the sub-types discussed ; Repeated as it is composed of 2 Session Objects (0 and 1 both branching from the bracket next to "Sessions" field name). Each element in the list is a dictionary(object), so its Data type is dictionary-repeated or Repeated-Composite. The field is nested because each element in the list has nested fields like "user.sessions<0].actionsn1]" |
user.sessions {} | String | Single Valued (Mandatory) | Flat | Mandatory Fields Appear in ALL of their parents' instances. I.e. Every "actions" object has an "action_type" |
user.sessions {} | String | Single Valued (Optional) | Flat | Optional Fields Appear in SOME of their parents' instances. I.e. Not every "actions" object has a "query", but only the "search" actions will have it . |
Exporting Logs from Python Scripts
Tip: When exporting logs from Python scripts, always make sure to ;
- Switch the single quote ` to double quotes ".
- Keep True/False are both lowercase.
- Remove None or Null and convert them to strings as "None" or "Null ".
- Pasting logs with formatting errors (like incorrect JSON or unbalanced brackets) into the parser UI will result in an error message.
If the logs are in the correct JSON format, they will appear like this in the UI ;
JSON Path Basics
This section covers the fundamentals of JSON Path, a query language for navigating and extracting data from JSON documents. Using a JSON Path evaluator, paste sample log messages and construct queries to select specific JSON fields. This will help you visualize JSON schemas and understand the distinctions between JSON Path and the Google GoStash parser syntax
In general in GoStash, the first steps in any JSON parsers are ;
- Use the JSON parse clause with serializing array fields -more into that later-, done by ;
filter {
json { source => "message" array_function => "split_columns"}
statedump {}}
}
- GoStash uses distinct syntaxes for field referencing, determined by the operator (e.g., assignment, conditional, loop, merge). This section will delve into the 'replace' operator, loops and if-conditionals, given their prevalence. The 'merge' operator will be addressed in a dedicated section
There are lots of JSON Path evaluators, we will be using https://jsonpath.com/ developed by “Hamasaki Kazuki” in this guide.
Select The Whole JSON |
$ |
|
|
Note: JSON path returns a list by default, i.e. query a log message with $ will return p …<log message content>... ] , but for the sake of simplicity, we will assume it returns just the log message (without the list) when compared with Gostash and JSONPath.
For example to select a whole JSON log using "$" ;
The actual return is ; 7{"user": "Alex"}] not {"user": "Alex"}, for simplicity we will ignore this in the following sections.
Select a Simple field |
$.event_type |
|
The first operator $ selects the root json log, while the dot operator "." moves the selection one level down to the subfields nodes, and "event_type" picks the subfield node "event_type, So in effect this selects the value of the event_type field which is “user_activity” In Gostash, to reference the same field :
filter { json { source => "message" array_function => "split_columns"} mutate { replace => { "myVariable" => "%{event_type}" }} }
filter { json { source => "message" array_function => "split_columns"} if pevent_type]=="abc"{ } }
|
Select a subfield field |
$.user.id |
|
Same like above ; $ selects the root json log, "." moves one level down to the subfields nodes, "event_type" picks the subfield node "event_type", but we move 1 step further and reference the 2nd level field "id" In Gostash, to reference the same field :
I. For non-string fields : Not Supported. For example user.id is an integer not a string, so filter { json { source => "message" array_function => "split_columns"} mutate { replace => { "myVariable" => "%{user.id}" }} }
Will give an error, as "user.id" is an integer field not a string field -as we highlighted earlier. II. For string fields: Supported ; e.g. for "user.username" field is a string field, referenced by %{user.username} filter { json { source => "message" array_function => "split_columns"} mutate { replace => { "myVariable" => "%{user.username}" }} }
filter { json { source => "message" array_function => "split_columns"} if suser]"id] == 13 { } } IF Conditionals in logstash use bracket notation instead of dot notation to reference nested fields. IF conditionals support both string and non-string data types.
|
Select a Repeated field |
$.Tags |
|
The syntax is similar to the above cases, but requires appending " If we need to access the 2nd tags, The JSON Path is $.Tags01] In Gostash, to reference the 2nd Tag field :
In GoStash, you cannot use wildcard syntax (like %{Tags.*}) to access all items in a repeated field. You must reference specific elements by their index, such as the first or second tag.filter { json { source => "message" array_function => "split_columns"} mutate { replace => { "myVariable" => "%{Tags.*}" }} statedump {} }
"%{Tags.1}" is supported filter { json { source => "message" array_function => "split_columns"} mutate { replace => { "myVariable" => "%{Tags.1}" }} statedump {} }
Not supported. Repeated fields won’t be accessible, i.e. %{Tags.1} is not possible if the JSON is parsed without the "array_function"; filter { json { source => "message"} mutate { replace => { "myVariable" => "%{Tags.1}" }} statedump {} } "array_function" should be always used to allow accessing repeated fields.
filter { json { source => "message" array_function=>"split_columns"} if cTags]n1] == "dev" { statedump {}} } Without array_function: Not supported and will give an error ; filter { json { source => "message" } if statedump {}} }
I. With array_function: Supported ; filter { json { source => "message" array_function => "split_columns"} for index, _tag in Tags { statedump {}} } The loop will execute two times, corresponding to the two values in the 'Tags' field. Pay attention to the use of 'Tags' without the '%{}' syntax. We'll cover the reason for this in a later section. GoStash's syntax might allow you to write loops referencing specific indexes of repeated fields (e.g., Tags.0), but this will result in a logical error, and the loop's content will be skipped. filter { json { source => "message" array_function => "split_columns"} for index, _tag in Tags.0 { statedump {}} } Ii. Without array_function: loop won't produce a syntax error, but it will fail to execute. This is a logical error, meaning the loop is structurally incorrect. filter { json { source => "message" }#array_function => "split_columns"} for index, _tag in Tags { statedump {}} } |
Select a Composite Field |
$.system or $.system"*o |
OR |
This example is a particular distinction between JSON Path and Gostash. You can select the composite field in JSONPath same way as selecting the simple fields using $.system , but it will generate a list of objects ; H { "hostname": "server-001", "ip_address": "192.168.1.100" } ] Appending " "server-001", "192.168.1.100" ] In Gostash, there is no direct equivalent to $.system Looping is supported for composite fields, you can loop for each sub-field ;
filter { json { source => "message"} mutate { replace => { "myVariable" => "%{system.hostname}" }} statedump {} }
filter { json { source => "message"} if asystem] hostname] == "server-001" { statedump {} } }
filter { json { source => "message"} for index, systemDetail in system map { statedump {}} } This is a distinction from looping through repeated fields which does not require the "map" keyword. This will be discussed later in more details. |