Time Travel in the SOC: Troubleshooting Late-Triggering YARA-L Rules

Forum|Forum|3 days ago
May 6, 2026
0 replies
36 views

+16

dnehoda
Staff

Author: David Nehoda, Technical Solutions Consultant

The 11:47 Ticket

At 11:47 on a Tuesday, a detection engineer opens a P1. A YARA-L rule for AWS root console logins fired 14 hours after the event. The root account had logged in at 21:32 the previous night, performed IAM changes, and logged out. The rule was correct. The logic was tight. The tuning was sound. But the alert landed the next morning, long after the window to contain the session had closed.

The team spends two days on the wrong argument. The SIEM vendor gets blamed. The rule gets rewritten three times. An engineer drafts a. migration plan to a different detection platform. None of that is the problem.

The problem is two timestamps.

metadata.event_timestamp on that UDM event is 2025-10-13T21:32:04Z. metadata.ingested_timestamp is 2025-10-14T11:29:51Z. Delta: 13 hours, 57 minutes. That rules out the YARA-L engine entirely. The event arrived 14 hours late. What followed was a 60-second check of the AWS CloudTrail feed polling interval. Someone had set it to 12 hours during a rate-limit test six months earlier and never changed it back. Two clicks to restore the 5-minute interval. Rule fires within SLA the next time. Two days of engineering argument wasted.

This article is the deterministic way to rule out the wrong layer fast.

Executive Summary

Dimension	Undiagnosed Delay	Systematic Timestamp Audit
Detection Latency	6 to 24 hours (silent)	Under 5 minutes (near real-time)
Root Cause Isolation	Days to weeks of finger-pointing	Under 60 seconds via timestamp comparison
Engineering Waste	Weeks rewriting correct rules	Zero. Fix targets the actual bottleneck
Attacker Dwell	Extends linearly with delay	Bounded by detection plus response time
Breach Exposure	$4.5M+ average with extended dwell	Contained by rapid detection

Who this is for: Detection engineers, SOC analysts, and security architects who see YARA-L rules fire late in Google SecOps and need to isolate whether the problem is the log source, the ingestion pipeline, or the rule logic itself.

The Three Places Delay Lives

A late-triggering rule is a symptom, not a disease. The delay lives in exactly one of three layers:

Origin Delay. The vendor or log source did not generate or deliver the logs in time.
Ingestion Delay. The forwarder, feed, or parser pipeline held the log before it reached UDM.
Evaluation Delay. The YARA-L rule engine itself is slow due to misconfigured match windows, expensive regex, or state explosion.

The diagnostic process is deterministic. Compare two timestamps, and the layer reveals itself.

Vendor Delivery Latency: The Table You Should Memorize

Before you debug anything, internalize what "on time" actually means. No SIEM configuration can make a log arrive faster than the source will send it.

Vendor / Source	Typical Delivery Latency	Worst Case
AWS CloudTrail (S3 polling)	5 to 15 minutes	30+ minutes during AWS service events
AWS CloudTrail (EventBridge)	Under 1 minute	2 to 3 minutes
O365 Management Activity API	5 to 30 minutes	Hours during Microsoft outages
Azure AD Sign-in Logs	2 to 15 minutes	30+ minutes
Azure Event Hub streaming	2 to 5 minutes	10 minutes
Okta System Log	1 to 5 minutes	15 minutes
GCP Cloud Audit Logs (Pub/Sub)	1 to 3 minutes	10 minutes
CrowdStrike Streaming API	Under 1 minute	5 minutes
Duo Admin API	2 to 10 minutes	30 minutes
Salesforce EventLogFile	Up to 24 hours (hourly tier)	24 hours (daily tier)

Read this carefully. If your YARA-L rule fires 15 minutes after an AWS CloudTrail event, that is inside AWS's own SLA. The SIEM is working. AWS took 15 minutes to write the log into the S3 bucket the feed polls. Rewriting the rule will not shrink that gap. Switching to EventBridge will.

Step 1: The Timestamp Interrogation

Every UDM event carries two timestamps. Comparing them isolates the delay in under a minute.

Timestamp	Set By	Meaning
metadata.event_timestamp	The log source	When the event physically occurred on the endpoint, server, or cloud API
metadata.ingested_timestamp	SecOps	When SecOps received, parsed, and indexed the log into UDM

Running the Comparison

Open the SOAR case for the late-firing alert.
Navigate to the underlying UDM event that triggered the detection.
Inspect the raw JSON and extract both timestamps.
Calculate ingested_timestamp - event_timestamp.

Interpreting the Delta

Delta	Diagnosis	Meaning
Minutes to hours	Origin or Ingestion Delay	The log arrived late. The YARA-L engine evaluated it promptly. The rule is innocent. The problem is upstream.
Seconds or near-zero	Evaluation Delay	The log arrived on time, but the YARA-L rule took hours to fire. The engine is the bottleneck. Match window, regex, or queue pressure.
Negative (event > ingested)	Clock Skew	The log source's system clock is ahead of UTC. Fix NTP on the source. Temporal correlation cannot be trusted until you do.

Most of the time, the delta is large. Most of the time, the rule is innocent. Believe the timestamps.

Step 2: Fixing Upstream Delays

If the interrogation proves the log arrived late, debug the pipeline, not the rule.

2A. Feed Polling Interval

For cloud-to-cloud API feeds (Okta, Azure AD, O365, AWS via S3), SecOps polls on a configured interval.

Diagnostic: Navigate to Settings > Feeds and inspect the polling interval for the affected log type.

Interval	Effect	When to Use
1 minute	Near real-time for API feeds	Critical identity telemetry (Okta, Azure AD)
5 minutes	Standard for most cloud feeds	Default for non-critical sources
15 minutes	Cost-optimized for high-volume, low-priority sources	Network flow logs, verbose audit logs
12 hours	Almost certainly a misconfiguration	Never appropriate for security telemetry

The most common mistake in the field: an engineer sets the interval to 12 hours during initial testing to avoid API rate limits, forgets to change it back, and the SOC operates with a 12-hour blind spot for months. The opening incident of this article is exactly that mistake.

2B. Switch from Polling to Streaming

For critical detections, stop polling. Subscribe.

Cloud	Polling Path	Streaming Path	Latency Improvement
AWS	CloudTrail > S3 > SecOps poll	CloudTrail > EventBridge > Lambda > Chronicle Ingestion API	15 min to under 1 min
Azure	Activity Log > Storage account poll	Event Hub > Chronicle Forwarder	15 min to 2-5 min
GCP	Cloud Audit > Storage poll	Cloud Audit > Pub/Sub > Chronicle	Already sub-minute, skip polling entirely
O365	Management Activity API poll	Event Hub via Microsoft Graph connector	30 min to 2-5 min

The economics usually favor streaming. The engineering cost of one EventBridge rule and a Lambda is recovered the first time a real incident is detected fast enough to contain.

2C. Overloaded Forwarder

If you run an on-prem Chronicle Forwarder, the host's resources directly impact ingestion latency.

Diagnostic signs:

Sustained CPU over 80%
Memory pressure causing swap usage
Disk I/O saturation on the syslog receiving buffer
Network saturation between forwarder and ingestion endpoint

Fix: Allocate more resources, or split the workload across multiple forwarders by log type. High-volume sources (firewall netflow, DNS queries) should not share a forwarder with low-volume, high-priority sources (identity logs, EDR alerts). One noisy neighbor can starve a critical feed.

Step 3: Fixing YARA-L Evaluation Delays

If logs arrived on time but the alert fired late, the rule engine is the bottleneck. Three common causes.

3A. The Sliding Window Trap

A rule with match: $target.ip over 24h forces the engine to maintain running state for every unique $target.ip across 24 hours. With millions of unique IPs, state management consumes massive memory and CPU, delaying evaluation for every rule on the tenant, not just the offending one.

Shrink match windows to the minimum viable timeframe for the actual attack pattern:

Attack Pattern	Appropriate Window	Why
Brute force to success	over 1h	The attack completes in minutes, not hours
Impossible travel	over 4h	Generous for international travel + VPN lag
Low-and-slow data exfiltration	over 24h	Justified. The attack is deliberately slow
Malware drop to execution	over 10m	Execution follows drop within seconds
Kerberoasting spray	over 1h	Spraying completes in minutes

Rule of thumb: If the attack completes in minutes, the match window should be minutes. A 24-hour window on a brute force rule wastes engine resources for zero detection benefit.

3B. Unanchored Regex (ReDoS)

Complex, unanchored regex applied to the full UDM dataset consumes exponential compute. The engine can throttle or disable the rule entirely.

// BAD: regex scans every single event in UDM
re.regex($e.target.process.command_line, `(?i).*invoke-mimikatz.*`)

// GOOD: filter narrows the dataset, regex evaluates a tiny subset
$e.metadata.event_type = "PROCESS_LAUNCH"
re.regex($e.target.process.command_line, `(?i).*invoke-mimikatz.*`)

Without the event_type filter, the regex runs against every UDM event including network connections, DNS queries, email transactions, and file creations, none of which will ever contain "invoke-mimikatz" but all of which must still be evaluated. The filter drops the evaluation surface by 95 percent or more.

Additional regex discipline:

Anchor patterns when possible. (?i)\\powershell\.exe$ is faster than (?i).*powershell.*.
Avoid nested quantifiers. (a+)+ causes catastrophic backtracking.
Use exact string operations when an exact match is sufficient. $e.target.process.file.full_path = "/usr/bin/curl" beats any regex.

Catch this before it ships. The YaraL Validator flags unanchored regex and missing event-type filters at commit time, not after a rule goes live and tanks tenant performance. Put it in CI.

3C. Tumbling Windows

YARA-L 2.0 Tumbling Windows segment data into fixed, non-overlapping intervals for deduplication. Unlike sliding windows (continuously evaluated), tumbling windows evaluate once at the end of each interval.

The trap: with a 1-hour tumbling window, an event arriving at 10:01 does not trigger an alert until 11:00 when the window closes.

Use Tumbling For	Use Sliding For
"Alert once per hour if >100 failed logins occur"	"Alert the moment the 10th failed login within 10 minutes arrives"
Aggregate statistics, deduplication	Real-time attack chain detection
Volumetric alerts with defined cadence	Any detection where time-to-fire matters

Default to sliding. Reach for tumbling only when batched aggregation is the actual requirement.

Proactive Monitoring: Stop Waiting for Analysts to Complain

Do not wait for a detection engineer to open a ticket. Build automated health monitoring.

Ingestion Latency Dashboard

Schedule a UDM query that calculates the average delta between event_timestamp and ingested_timestamp per log type. Alert the Detection Engineering channel if any log type's average exceeds your threshold (typical: 15 minutes).

Feed Health Check

Use the SecOps v1alpha/feeds API to programmatically verify each feed last polled within its expected interval. If a feed has not polled within 2x its configured interval, it is almost certainly broken. Page someone.

Rule Evaluation Monitor

Track the delta between ingested_timestamp and detection_timestamp. If a specific rule consistently exceeds 5 minutes for this delta, that rule needs performance optimization. Open a ticket automatically.

Production Deploy Checklist

Before calling a late-trigger incident closed:

Timestamp interrogation run. Delta calculated. Root cause layer identified.
If Origin/Ingestion: polling interval verified under 5 minutes for all identity and critical-auth feeds.
If AWS is involved: EventBridge path considered for critical detections.
If Evaluation: match window audited against actual attack tempo. Shrunk where justified.
If Evaluation: every regex in the offending rule has an event_type or equivalent pre-filter.
Ingestion latency dashboard live, alerting on 15-minute threshold breach per log type.
Feed health API check scheduled, paging when a feed misses 2x its interval.
Rule evaluation monitor live for all CRITICAL-severity detections.
Runbook updated: "Every late alert ticket starts with the timestamp interrogation. Not with a rule rewrite."
Post-incident: if a feed misconfiguration was the cause, infrastructure-as-code the feed definition so a human cannot set 12-hour polling by hand next time.

The Diagnostic Truth

Late-triggering rules are a three-variable equation.

Large delta: upstream delay. Fix the pipeline.
Zero delta: evaluation delay. Fix the rule.
Negative delta: clock skew. Fix NTP.

Every time a rule fires late, run the interrogation first. Most of the time, the engine is innocent and the delay lives in a vendor's API, a misconfigured feed, or an under-resourced forwarder. The rewrites and vendor-blame sessions you avoid are the real ROI.