Specifications
Brief tutorial
Regexps are built up from expressions, quantifiers, and assertions. The simplest form of expression
is simply a character, example, x or 5. An expression can also be a set of characters. For example,
[ABCD] will match an A or a B or a C or a D. As a shorthand we could write this as [A-D]. If we
want to match any of the capital letters in the English alphabet we can write [A-Z]. A quantifier
tells the regexp engine how many occurrences of the expression we want, e.g. x{1,1} means
match an x which occurs at least once and at most once. We will look at assertions and more
complex expressions later.
We will start by writing a regexp to match integers in the range 0 to 99. We will require at least
one digit so we will start with [0-9]{1,1} which means match a digit exactly once. This regexp
alone will match integers in the range 0 to 9. To match one or two digits we can increase the
maximum number of occurrences so the regexp becomes [0-9]{1,2} meaning match a digit at
least once and at most twice. However, this regexp as it stands will not match correctly. This
regexp will match one or two digits within a string. To ensure that we match against the whole
string we must use the anchor assertions. We need ^ (caret) which when it is the first character
in the regexp means that the regexp must match from the beginning of the string. And we also
need $ (dollar) which when it is the last character in the regexp means that the regexp must
match until the end of the string. So now our regexp is ^[0-9]{1,2}$. Note that assertions, such
as ^ and $, do not match any characters.
If you have seen regexps elsewhere, they may have looked different from the ones above.
This is because some sets of characters and some quantifiers are so common that they have
special symbols to represent them. [0-9] can be replaced with the symbol \d.
The quantifier to match exactly one occurrence, {1,1}, can be replaced with the expression itself.
This means that x{1,1} is exactly the same as x alone. So our 0 to 99 matcher could be written
^\d{1,2}$. Another way of writing it would be ^\d\d{0,1}$, i.e. from the start of the string match
a digit followed by zero or one digits. In practice most people would write it ^\d\d?$. The ? is a
shorthand for the quantifier {0,1}, i.e. a minimum of no occurrences and a maximum of one
occurrence. This is used to make an expression optional. The regexp ^\d\d?$ means "from the
beginning of the string match one digit followed by zero or one digits and then the end of the
string".
Our second example is matching the words 'mail', 'letter' or 'correspondence' but without
matching 'email', 'mailman', 'mailer', 'letterbox' etc. We will start by just matching 'mail'. In
full the regexp is m{1,1}a{1,1}i{1,1}l{1,1}, but since each expression itself is automatically quantified
by {1,1} we can simply write this as mail; an m followed by an a followed by an i followed by an
l. The symbol | (bar) is used for alternation, so our regexp now becomes mail|letter|correspondence
which means match 'mail' or 'letter' or 'correspondence'. Whilst this regexp will find the words
we want it will also find words we do not want such as 'email'.
We will start by putting our regexp in parentheses, (mail|letter|correspondence).
Parentheses have two effects, firstly they group expressions together and secondly they identify
parts of the regexp that we wish to capture for reuse later on. Our regexp still matches any of
the three words but now they are grouped together as a unit. This is useful for building up more
complex regexps. It is also useful because it allows us to examine which of the words actually
matched.
We need to use another assertion, this time \b "word boundary": \b(mail|letter|correspondence)\b.
This regexp means "match a word boundary followed by the expression in parentheses followed
by another word boundary". The \b assertion matches at a position in the regexp not a character
197
Enfocus Switch 10