HP StorageWorks Reference Information Storage System V1.0 User Guide (May 2004)
Query Expression Syntax and Matching Chapter 5:
Query Syntax and Matching
HP StorageWorks Reference Information Storage System User Guide, April 2004 5-5
The following regular expression provides, in succinct form, a complete speci-
fication of English word characters (except for the treatment of
&&
as a non-
word):
[ A-Za-z0-9_#& ]+
See Also
•
Stop Words
, on page 5-7
•
Matching Words
, on page 5-7
•
Boolean Query Expressions
, on page 5-10
Letters and Digits in Different Character Sets
Letters and Digits Defined
All letters and digits are word characters. Just what the RISS software
considers a letter or a digit depends on the character set encoding used. For
the US ASCII encoding, the letters are uppercase and lowercase English
letters (
A–Za–Z
). For the ISO 8859–1 (Latin–1) encoding, used for Western
European languages, accented letters are included. Most ideographic
characters, such as used in Asian languages, are also considered letters.
Whatever the language and encoding used for a particular document (file or
email message), the RISS software maps encoded characters to the
Unicode 2.0 standard. The Unicode 2.0 standard is then used to determine if
a given character is a letter or a digit (or neither):
•
A
letter
is any Unicode character in one of these Unicode categories:
Ll (lowercase letter), Lu (uppercase letter), Lt (titlecase letter),
Lm (modifier letter), or Lo (other letter).
•
A
digit
is any Unicode character whose Unicode name contains the word
DIGIT
, provided it is not in the range
\u2000
(en quad = en space) through
\u2FFF
(ideographic description – future).
This includes the digits of the following character sets: ISO 8859–1
(Latin–1), Arabic-Indic, Extended Arabic-Indic, Devanagari, Bengali,
Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai,
Lao, Tibetan, and Fullwidth.