In my effort to increase my knowledge of the tools that comprise the linux command line, I've taken the time to review how to use regular expressions by reading a short guide written in 1991. Below are my notes.
***
Regular expressions are used to search for lines of text that match a specific pattern. RE's match text on a per line basis - they do not match patterns that start on one line and end on another.
Characters used for regular expressions can be placed into one of three categories: anchors, character sets, and modifiers.
Anchors are used to specify the position of the pattern in relation to a line of text. Character Sets match one or more characters in a single position. Modifiers specify how many times the previous character set is repeated.
I. Anchors
Two anchor characters are "^" and "$". They are used to match to the beginning and end of a line respectively.
The characters "\<" and "\>" are anchors as as well, see the modifier section below for their details.
II. Character Sets
A simple character set is "hello", this will match "hello" anywhere in the text.
The "." wildcard matches any character.
Characters inside square brackets "[]" match to a character if the character equals any 1 of the characters inside the square brackets.
^[012345689]$ will match a single digit that is not the digit 7.
Square brackets can be used with a shorthand range notation. This shorthand notation can include multiple ranges.
^[0-9]$ is equivalent to ^[0123456789]$
[A-Za-z0-9_] will match a single number, letter, or underscore.
Multiple character sets can be combined by placing them adjacent to each other. Placing a "^" before a character set matches to all characters *except* what is in the brackets.
[aeiou][^aeiou] matches a vowel followed by a non-vowel.
To include a "-" in a character set one can place "-" directly after the opening square bracket. To include a "[" or a "]" one can use an escape backslash.
[-0-9] will match a "-" or a digit.
[-\]] will match a "-" or a "]".
III. Modifiers
The asterisk "*" matches zero or more copies of the previous character set. To match one or more of the previous character set, one can repeat the character set before placing the asterisk.
[0-9][0-9][0-9]* will match a number with 2 or more digits.
To match a character set to text that repeats that character set within a specific range of times, one can use "\{" and "\}". Note that this is an example of when a backslash, normally used to escape characters, causes a character to have a function within a regular expression. Backslashes enabling a special function occurs when the backslash is placed before a "<" or ">" or "{" or "}" or "(" or ")" or a digit.
[A-Z]\{3,5\} will match 3-5 upper case letters.
The possible values for x and y in \{x,y\} are 0-255. If the y value is omitted, the character set can be repeated x or more times. If the comma is also omitted then the character set must repeat exactly x times.
To search for words, character sets are placed between "\<" and "\>".
"\<[tT]he\>" will match the word "the" (but not to a word containing "the" such as "other".)
Backreferences - i.e. matching to a previously found pattern - are made using "\(,\)" and "\1".
"\([a-z]\)\1" will match two adjacent identical lower case letters.
Extended Regular Expressions
egrep and awk use extended regular expressions. ERE's do not have the characters whose special meaning are activated via backslash - "\{" , "\}", "\<", "\>", "\(", "\)" .
ERE's have two modifiers not found in basic regular expressions:
"?" matches to 0 or 1 instances of the character set before.
"+" matches to 1 or more isntances of the character set before.
Extended Regular Expressions have a special meaning for the "(" "|" ")" - these characters allow "or" functionaity - allowing one to match a choice of patterns.
"(From|Subject}" will match either "From" or "Subject".
===
After reviewing the guide I was left wondering: for basic regular expressions, how does one match a string that appears between x and y times in a row. For example:
Say I want to match to a line that contains 2-4 adjacent ha's and no more, i.e. either:
haha
hahaha
hahahaha
I discovered that one can combine the special characters "\(,\)" with the special characters "\{,\}"
A regex that matches the above three lines, and only the above three lines, is:
"^\(ha\)\{2,4\}$"