Specifications

ManualsBrandsEnfocus ManualsSoftwarePowerSwitch 10, VL Level B, 3-5u, Box

191

192

193

194

195

196

197

198

199

200

Brief tutorial

Regexps are built up from expressions, quantifiers, and assertions. The simplest form of expression

is simply a character, example, x or 5. An expression can also be a set of characters. For example,

[ABCD] will match an A or a B or a C or a D. As a shorthand we could write this as [A-D]. If we

want to match any of the capital letters in the English alphabet we can write [A-Z]. A quantifier

tells the regexp engine how many occurrences of the expression we want, e.g. x{1,1} means

match an x which occurs at least once and at most once. We will look at assertions and more

complex expressions later.

We will start by writing a regexp to match integers in the range 0 to 99. We will require at least

one digit so we will start with [0-9]{1,1} which means match a digit exactly once. This regexp

alone will match integers in the range 0 to 9. To match one or two digits we can increase the

maximum number of occurrences so the regexp becomes [0-9]{1,2} meaning match a digit at

least once and at most twice. However, this regexp as it stands will not match correctly. This

regexp will match one or two digits within a string. To ensure that we match against the whole

string we must use the anchor assertions. We need ^ (caret) which when it is the first character

in the regexp means that the regexp must match from the beginning of the string. And we also

need $ (dollar) which when it is the last character in the regexp means that the regexp must

match until the end of the string. So now our regexp is ^[0-9]{1,2}$. Note that assertions, such

as ^ and $, do not match any characters.

If you have seen regexps elsewhere, they may have looked different from the ones above.

This is because some sets of characters and some quantifiers are so common that they have

special symbols to represent them. [0-9] can be replaced with the symbol \d.

The quantifier to match exactly one occurrence, {1,1}, can be replaced with the expression itself.

This means that x{1,1} is exactly the same as x alone. So our 0 to 99 matcher could be written

^\d{1,2}$. Another way of writing it would be ^\d\d{0,1}$, i.e. from the start of the string match

a digit followed by zero or one digits. In practice most people would write it ^\d\d?$. The ? is a

shorthand for the quantifier {0,1}, i.e. a minimum of no occurrences and a maximum of one

occurrence. This is used to make an expression optional. The regexp ^\d\d?$ means "from the

beginning of the string match one digit followed by zero or one digits and then the end of the

string".

Our second example is matching the words 'mail', 'letter' or 'correspondence' but without

matching 'email', 'mailman', 'mailer', 'letterbox' etc. We will start by just matching 'mail'. In

full the regexp is m{1,1}a{1,1}i{1,1}l{1,1}, but since each expression itself is automatically quantified

by {1,1} we can simply write this as mail; an m followed by an a followed by an i followed by an

l. The symbol | (bar) is used for alternation, so our regexp now becomes mail|letter|correspondence

which means match 'mail' or 'letter' or 'correspondence'. Whilst this regexp will find the words

we want it will also find words we do not want such as 'email'.

We will start by putting our regexp in parentheses, (mail|letter|correspondence).

Parentheses have two effects, firstly they group expressions together and secondly they identify

parts of the regexp that we wish to capture for reuse later on. Our regexp still matches any of

the three words but now they are grouped together as a unit. This is useful for building up more

complex regexps. It is also useful because it allows us to examine which of the words actually

matched.

We need to use another assertion, this time \b "word boundary": \b(mail|letter|correspondence)\b.

This regexp means "match a word boundary followed by the expression in parentheses followed

by another word boundary". The \b assertion matches at a position in the regexp not a character

197

Enfocus Switch 10