Regex

Forum|Forum|11 months ago
December 27, 2024
2 replies
22 views

+7

anurag.q.singh
Bronze 2

So I am writing a yaraL rule and I am stuck with a problem.

I want to write a regex to capture the domain part of a url.

For example :

For https://www.example.com/path, it captures example.com
For http://example.org, it captures example.org
For subdomain.example.co.uk/path, it captures subdomain.example.co.uk

my regex for this is :

$domain = re.capture($e.any_variable, " (?:https?:\\/\\/)?(?:www\\.)?([^\\/\\s]+) ")

The error I am getting here is :
tokenizing: unable to tokenize: invalid char escape

Can somebody help me with this.

@AymanC @jstoner

+12

AbdElHafez
Staff
Forum|Forum|11 months ago
December 27, 2024

Could you try ;

(?:https?://)?(?:www\\\\.)?([^/\\\\s]+)

You do not need to escape "/" , and to use the special regex pattern in Yara-L you would need to use double backslashes, In your case you would use "\\\\s" not "\\s" .

Like

+22

jstoner
Staff
Forum|Forum|11 months ago
December 27, 2024

I won't pretend that my regex chops are going to get a perfect extraction but I would point out that generally when using the regex functions you would use backticks ` rather than quotes around the regular expression.

My preference would be to use something like strings.extract_hostname or strings.extract_domain instead. Examples for these functions are in this blog: https://www.googlecloudcommunity.com/gc/Community-Blog/New-to-Google-SecOps-Domain-and-Hostname-Extraction-is-NOT-Like/ba-p/819666

Like

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded