Adoption Guide: SOAR Health Monitoring

Forum|Forum|2 months ago
December 22, 2025
0 replies
392 views

Digital-Customer-Excellence
Staff

In any Security Operations Center (SOC), a critical question must be addressed when deploying SOAR: How do we know if the critical pieces of SecOps are working?

This is an important question that needs to be addressed when deploying SecOps. Without visibility into the performance and health of your automation workflows, your organization risks delayed response times and failed mitigations. In this guide, we’ll explore the necessary steps to monitor for performance issues and errors across the key components of the platform, including:

View playbook performance
Python/playbook errors
Remote agent errors
ETL processes

Monitoring Playbook Performance with Dashboards

To gain high-level, aggregate insights into your automation, we’ll start by taking a look at Playbook metrics in Dashboards. Before we review the dashboard, lets first quickly cover how to access Dashboards and the required permissions

Accessing SOAR Dashboards

Log in to your Google SecOps portal.
Open the Navigation Menu: Hover over the sliding left navigation bar to expand the menu.
Navigate to Dashboards: Select Dashboards & Reports > Dashboards.
Select a curated or customer dashboard.

Required Permissions

Access to these dashboards is governed by Identity and Access Management (IAM). To view SOAR data sources, you must be a global user. Key permissions include:

chronicle.nativeDashboards.list
chronicle.nativeDashboards.get
chronicle.nativeDashboards.create
chronicle.nativeDashboards.duplicate
chronicle.nativeDashboards.update
chronicle.nativeDashboards.delete

Playbook Dashboard (SOAR)

SecOps offers a curated dashboard named “Playbook Dashboard (SOAR)”. This dashboard is a vital tool, presenting a lot of useful information around Playbooks related to their performance, distribution, and errors.

Key Performance Indicators (KPIs)

The following charts available on the Playbook Dashboard will help us rapidly identify potential issues and bottlenecks with our playbooks:

Average Runtime in Minutes: Monitors efficiency and provides the average runtime of all playbooks
% Error Runs: Provides the overall percentage of playbook runs that resulted in an error.
% Errored Playbook Runs Per Playbook: Identifies specific playbooks that are failing most often.
Average Runtime per Playbook: Provides the average runtime for each playbook.
Playbook Queue: Lists the numbers of playbooks that are queued..
% Faulted Actions: Displays a list of actions that have failed within a playbook.

In addition to using curated dashboards, you can create custom visualizations using the Playbook data set. This data set provides deep visibility into the execution lifecycle of your automation, allowing you to track exactly how playbooks are performing over a 365-day retention period.

Deep Dive into SOAR Logging with Logs Explorer

When an issue is identified in the dashboard, the next step is to investigate the raw logs for granular debugging. SOAR logs capture ETL processes, playbook executions, and python functions. These logs are extremely helpful in proactively monitoring the operational health of your SOAR and provide the insight necessary to debug playbook and integration failures.

The primary tool for viewing and analyzing these logs is the Google Cloud Logs Explorer.

Accessing Google Cloud Operations Tools

Both tools are located in the Google Cloud Console:

Logs Explorer: In the navigation panel, go to Logging > Logs Explorer.
Cloud Monitoring: In the navigation panel, go to Monitoring > Alerting (for alerts) or Dashboards.

Required IAM Roles

Access to these logs and metrics is governed by Google Cloud IAM. Ensure you or your service account has the following roles assigned at the project level:

Logs Explorer
- roles/logging.viewer
- roles/logging.privateLogViewer
- roles/logging.admin
Cloud Monitoring
- roles/monitoring.viewer
- roles/monitoring.alertPolicyEditor

Setting Up SOAR Log Collection

This process will differ depending on if you have Google SecOps or standalone Google SecOps SOAR platform. You can find the appropriate documentation to walk through the initial setup:

Google SecOps: https://docs.cloud.google.com/chronicle/docs/secops/collect-secops-soar-logs
Standalone Google SecOps SOAR: https://docs.cloud.google.com/chronicle/docs/soar/investigate/collecting-soar-logs

SOAR Log Organization

To start, let’s see how SOAR logs are organized and filtered in Logs Explorer.

To see all logs originating from SOAR, you can use the following fundamental query, which targets the primary namespace label:

resource.labels.namespace_name="chronicle-soar"

For more granular analysis, you can query relevant service containers, which will have their own set of labels for further filtering.

Filtering by Container and Labels

Python Service

The python service handles connectors, jobs, and actions. The logs generated here are critical for debugging integration failures.

Component	Key Labels for Filtering
Integrations & Connectors	integration_name , integration_version , connector_name , connector_instance
Jobs	integration_name , integration_version , job_name
Actions	integration_name , integration_version , integration_instance , correlation_id , action_name

Example: Viewing logs from a specific connector

To see all logs from any instance of a Google Chronicle Alerts Connector, combine the namespace, container, and connector labels:

resource.labels.namespace_name="chronicle-soar"
resource.labels.container_name="python"
labels.connector_name="Google Chronicle - Chronicle Alerts Connector"

Playbook Service

To isolate logs related to the flow logic and errors within your automation workflows, target the playbook container.

Available Playbook Labels:

playbook_name
playbook_definition
block_name
block_definition
case_id
correlation_id
integration_name
action_name

Example: Viewing logs for a specific playbook

resource.labels.namespace_name="chronicle-soar" resource.labels.container_name="playbook"

labels.playbook_name="Google Chronicle - Windows Threats Investigation and Response Playbook"

ETL Service

The ETL service handles data ingestion and processing. It’s essential to monitor this to ensure data is flowing correctly into your SOAR environment.

Available ETL Labels:

Correlation_id

Example: General ETL query

Use Cases

One of the most critical monitoring tasks is identifying and resolving playbook failures. A silent failure means missed threat coverage, so a targeted search for errors is essential.

Specific Playbook Errors

To investigate a failure within a particular automated workflow, filter for all log entries generated by the playbook container that have a severity level of "ERROR", specifically targeting the playbook by name. This quickly narrows down the scope of the investigation.

resource.labels.namespace_name="chronicle-soar" resource.labels.container_name="playbook"

labels.playbook_name="Google Chronicle - Windows Threats Investigation and Response Playbook"

severity="ERROR"

Pinpointing Action and Python Failures

Actions executed by your playbooks, which handle connectors, jobs, and actions, run within the python service container. It is crucial to monitor this service for connector and integration issues. A simple severity filter might not suffice, as an action can "fail" gracefully but still return a problematic result, which is why we inspect the textPayload.

Failed Actions (General)

This query is powerful because it searches the raw log content (textPayload) for the specific string "Result Value: False". This directly identifies instances where an action completed but returned an unsuccessful outcome, regardless of the log's set severity level.

resource.labels.namespace_name="chronicle-soar"
resource.labels.container_name="python"
textPayload=~"Result Value: False"

Internal Server Errors

Quickly identify systemic service issues by searching for critical application errors. These often point to configuration or resource problems within the environment.

resource.labels.namespace_name="chronicle-soar"

resource.labels.container_name="python"

textPayload=~"Internal Server Error"

III. Tracking System Processes (ETL & Correlation)

Core system tasks, such as Extract, Transform, Load (ETL) processes, are fundamental to data flow. Monitoring these for errors ensures that your ingested data is being processed correctly.

ETL Failures

Filter on the “etl” container and look for any log entry with a severity of "ERROR". Consistent ETL errors can indicate issues with data sources or transformation scripts. The only available label for the ETL container is correlation_id.

resource.labels.namespace_name="chronicle-soar"

resource.labels.container_name="etl"

severity="ERROR"

Correlation ID Tracking

Every significant action or process run has a unique correlation_id. Using this label allows you to pull all related logs (from playbook, python, and etl containers) for a single comprehensive view of a specific event's lifecycle.

resource.labels.namespace_name="chronicle-soar"

labels.correlation_id="e4a0b1f4afeb43e5ab89dafb5c815fa7"

Filtering by Timestamp

When investigating an incident that occurred at a known time, accurate temporal filtering is essential. You can use RFC 3339 or ISO 8601 formats for precise searches.

Logs Newer Than (Specific Time)

Use this to view all activity occurring after a known event, such as a deployment or a reported error.

timestamp>="2023-12-02T21:28:23.045Z"

Logs for a Specific Day Range

This inclusive/exclusive filter retrieves all logs within a defined period. For example, the query below would capture all logs from December 1st and December 2nd.

timestamp>="2023-12-01" AND timestamp<"2023-12-03"

V. Proactive Monitoring (Beyond Ad-Hoc Search)

While manual searches are great for debugging, true operational health requires automation. Consider setting up:

Log-based alerts: Define alerts that trigger whenever one of the critical error queries (like textPayload=~"Internal Server Error") returns a log entry.
Metric-based alerts : Create alerts for thresholds and rates

Let’s work through an example of a log based alert where we want to know if a critical playbook experiences failures.

First, we’ll start with our example query from earlier in the doc. We can then click the ‘Actions’ button, then ‘Create log alert’ to begin creating our alert.

The initial section requires the configuration of alert specifics, including the policy name, the appropriate severity level, and necessary documentation. The documentation field is crucial for providing contextual details about the alert and its relation to the log data.

In the subsequent section, you designate which logs are pertinent to the alert. The log query will already be populated. You also have the option to extract specific log labels if required.

Finally, determine the notification frequency and specify the recipients. While this example focuses on sending an email notification, a variety of notification channels are available, such as Google Chat, Webhooks, and Slack. Further details on notification channels are accessible here.

The example alert below illustrates both the extracted labels and the essential policy documentation, offering crucial contextual insight regarding the alert and its connection to the underlying log data.

Conclusion

Topic Conclusion: Ensuring Operational Readiness

Monitoring the operational health of your SOAR platform is not a one-time task; it is an ongoing necessity for maintaining the integrity and effectiveness of your security automation. By leveraging the Playbook Dashboards for high-level performance metrics and mastering the Logs Explorer for granular debugging and alerting, you are equipped to proactively identify, diagnose, and resolve issues related to:

View playbook performance
Python/playbook errors
Remote agent errors
ETL processes

Implementing these monitoring practices will ensure that your automated processes are consistently working as intended, maximizing your security team's efficiency and minimizing response latency. Let’s understand more about using this infographics below.

Monitoring Playbook Performance with Dashboards

Accessing SOAR Dashboards

Required Permissions

Playbook Dashboard (SOAR)

Key Performance Indicators (KPIs)

Deep Dive into SOAR Logging with Logs Explorer

Accessing Google Cloud Operations Tools

Required IAM Roles

Setting Up SOAR Log Collection

SOAR Log Organization

Filtering by Container and Labels

Python Service

Playbook Service

ETL Service

Use Cases

Specific Playbook Errors

Pinpointing Action and Python Failures

Failed Actions (General)

Internal Server Errors

III. Tracking System Processes (ETL & Correlation)

ETL Failures

Correlation ID Tracking

Filtering by Timestamp

Logs Newer Than (Specific Time)

Logs for a Specific Day Range

V. Proactive Monitoring (Beyond Ad-Hoc Search)

Conclusion

Topic Conclusion: Ensuring Operational Readiness

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded