RegEx Quick Reference - SANS Internet Storm Center

Handler on Duty: Johannes Ullrich

Threat Level: green

We use Regular Expressions for parsing firewall logs. Even though the full syntax of Regular Expressions is complex, we usually only need to use a few regex operations to parse logs. This is oriented towards people who are writing complete log line parser. If you are looking at this to get some hints for how to phrase the regex for the line_filter variable in dshield.cnf, then you can just skim and pick up the highlights.

`line_filter` and `line_exclude` variables

Most of the time you can just type in the string that you want to match.
line_filter=input deny
The only time you need to worry about any of the regular expression pattern stuff is if the string you need to match contains one of the characters that are regex metacharacters. (See below.) Then just put a "\" before the character that is a metacharacter. For example, if you wanted to match "kernel 2.2" then
line_filter=kernel 2\.2
The other likely candidate is if you want to have several alternates. For instance, if you wanted to match either "input deny" and "input reject" then
line_filter=input deny|reject

Top of page

Parsing log lines

The general idea is to have a regular expression set up to match the different parts of the log and have the portions that want to be assigned to variable delimited by parentheses. Short Perl example. Assume that $line contains the "raw" log line that needs to be converted:
  if (($month,$day) = ($line =~ /^([A-Z][a-z]{2}) +(\d{1,2})/ )) {
        # Do something with $month and $day
        print "Month is $month Day is $day\n";
  } else {
        print "Regular expression match failed.  Couldn't parse $line\n";
  }
  
The portions that are delimited by parentheses are assigned to variables. In this case, we are attempting to match a *NIX syslog date, like Apr 24 and assign the month to the $month variable and the day to the $day variable.

[A-Z] matches one character that must be upper case. [a-z]{2} matches exactly two characters that must be lower case. " +" (space followed by "+") skips over one or more spaces. Then match \d{1,2} and assign to the $day variable. (Matches at least one and no more than two "digits.")

This example only extracted the month and day from the log line. To make it workable for our purposes, you would extend this concept and add additional variables and regular expression patterns to match the rest of the log.

Top of page

Very short regex reference

These "metacharacters" modify the meanings of other characters and need escaping with \ if they are to appear as literal characters in your pattern.

[](){}\.*?-+^$@

Example. If you were parsing a log line and the log put the protocol in square braces, like "[tcp]", you'd have to phrase the regex pattern like \[tcp\] so that the square brace characters [] are treated like literal characters and not as the beginning of a character class. [tcp] would match a single character that is either "t" or "c" or "p", which probably isn't what you want to do.

. matches any single character except for \n (newline.)

* Modifier. Zero or or more of the preceding character. ".*" to match a bunch of characters. "*" by itself probably won't do what you want, because "*" is a modifier.

+ Modifier. One or more of the preceding character. Same idea as "*", except that it requires at least one character to be present.

? Modifier. Zero or one of the preceding character.

\d Single digit. [0-9]. Use \d+ to match one ore more digits.

\w Word character. [A-Za-z0-9_] Upper and lower case letters, digits and underscore. No punctuation.

\s White space [ \r\t\n\f] Space, carriage return, tab, new line, or form feed.

\b Word boundry anchor. Anything that can come before or after a word. White space, punctuation and/or the beginning or end of a line.

^ Anchor. Requires that the pattern be at the beginning of the line. The "^" in the /^([A-Z][a-z]{2}) (match month) example means that this pattern must be at the beginning of the line to match.

^ Negation. "^\d" would match any single character except for a digit. Um, context counts.

$ Anchor. Same thing, only for the end of the line.

But changing the case negates. Same thing, only backwards. Example: \d matches a single digit, but \D matches anything except for a digit. Also applies to \w and \s.

Top of page

Examples

Stick these together to make a regex that can match all the parts of the log line that you are parsing. See above example for the syntax to assign these to variables. Note that these examples show parentheses, because you generally want to assign these to a variable. If you just want to match but not assign to a variable, then don't use the parentheses.

([A-Z][a-z]{2}) Matches month formatted like "Apr" First character must be upper case, second and third characters, lower case. Note that the month is still in text format and you need to do an additional translation to get in into the "MM" numerical format that the DShield format requires (2002-05-24). Look at one of the existing parsers to see how to do this.

(\d{1,2}) Matches day number. (One or two digits)

(\d+):(\d+):(\d+) Time. HH, MM, and SS separated by ":", and are assigned to separate variables. This is a bit sloppy because we aren't enforcing any character count. "1", "10", and "1234567890" all match \d+. ":" is one of the few punctuation type characters that isn't a metacharacter and you can just put it there, without having to escape it with a "\".

(\d+\.\d+\.\d+\.\d+) Matches an IP address. Note that the periods need to be delimited with "\". And is sloppy (see above comment), but a precise IP regex pattern is real long and complicated. See The Perl Cookbook "Regex Grabbag" for an example of a precise IP matching regex pattern.

(\d{1,5}) Matches a port. (Minimum of one and maximum of five digits)

(tcp|udp) Alternation. Matches (lower case) "tcp" or "udp"

\-> Matches "->" The "-" needs escaping with a "\" because it is a metacharacter.

([SAFURP12]) Matches the valid single character flags

($[SAFURP12]$) Same as above, but assumes that the flags are stored like "(S)" in the log. So the () parentheses characters need to be escaped with "\".

\s+ Skip over one or more characters of whitespace. Not delimited by parentheses because we aren't assigning this to a variable.

.* Skips over zero or more characters of stuff that we aren't interested in. Use .+ to require at least one character.

Top of page

Hints

To develop a regex that can parse a whole line, start off with a short one, like the one in the above example, and add additional variables and matching patterns one at a time. If you try to write the entire regex pattern before you test it, you will most likely go mad trying to figure out why it doesn't match. Remember "one at a time."

This document covers most of what you need to know about regular expressions to fill in the line_filter variable, or to write a parser. But there is much more that regular expressions can do. See the sections on regular expressions in Learning Perl, Programming Perl, The Perl Cookbook, or Mastering Regular Expressions (O'Reilly.) Or just hit Google and search for "regular expression tutorial"

Top of page

`.`	matches any single character except for `\n` (newline.)
`*`	Modifier. Zero or or more of the preceding character. "`.`" to match a bunch of characters. "" by itself probably won't do what you want, because "*" is a modifier.
`+`	Modifier. One or more of the preceding character. Same idea as "*", except that it requires at least one character to be present.
`?`	Modifier. Zero or one of the preceding character.
`\d`	Single digit. `[0-9]`. Use `\d+` to match one ore more digits.
`\w`	Word character. `[A-Za-z0-9_]` Upper and lower case letters, digits and underscore. No punctuation.
`\s`	White space `[ \r\t\n\f]` Space, carriage return, tab, new line, or form feed.
`\b`	Word boundry anchor. Anything that can come before or after a word. White space, punctuation and/or the beginning or end of a line.
`^`	Anchor. Requires that the pattern be at the beginning of the line. The "`^`" in the `/^([A-Z][a-z]{2})` (match month) example means that this pattern must be at the beginning of the line to match.
`^`	Negation. "`^\d`" would match any single character except for a digit. Um, context counts.
`$`	Anchor. Same thing, only for the end of the line.

`([A-Z][a-z]{2})`	Matches month formatted like "Apr" First character must be upper case, second and third characters, lower case. Note that the month is still in text format and you need to do an additional translation to get in into the "MM" numerical format that the DShield format requires (2002-05-24). Look at one of the existing parsers to see how to do this.
`(\d{1,2})`	Matches day number. (One or two digits)
`(\d+):(\d+):(\d+)`	Time. HH, MM, and SS separated by ":", and are assigned to separate variables. This is a bit sloppy because we aren't enforcing any character count. "1", "10", and "1234567890" all match `\d+`. "`:`" is one of the few punctuation type characters that isn't a metacharacter and you can just put it there, without having to escape it with a "\".
`(\d+\.\d+\.\d+\.\d+)`	Matches an IP address. Note that the periods need to be delimited with "\". And is sloppy (see above comment), but a precise IP regex pattern is real long and complicated. See The Perl Cookbook "Regex Grabbag" for an example of a precise IP matching regex pattern.
`(\d{1,5})`	Matches a port. (Minimum of one and maximum of five digits)
`(tcp\|udp)`	Alternation. Matches (lower case) "tcp" or "udp"
`\->`	Matches "`->`" The "`-`" needs escaping with a "\" because it is a metacharacter.
`([SAFURP12])`	Matches the valid single character flags
`(\([SAFURP12]\))`	Same as above, but assumes that the flags are stored like "`(S)`" in the log. So the () parentheses characters need to be escaped with "\".
`\s+`	Skip over one or more characters of whitespace. Not delimited by parentheses because we aren't assigning this to a variable.
`.*`	Skips over zero or more characters of stuff that we aren't interested in. Use `.+` to require at least one character.

line_filter and line_exclude variables

Parsing log lines

Very short regex reference

Examples

Hints

`line_filter` and `line_exclude` variables