HP IAP Version 2.0 User Guide (November 2008)

ManualsBrandsHP ManualsSoftwareHP StorageWorks Reference Information Storage System Base System

Word characters and separators

Word characters include all uppercase and lowercase letters, digits, and the following additional

characters:

• _(underscore)

• # (number/pound/hash sign)

• & (ampersand)

All other chara

cters are separators (except in queries, wildcards ? and *, and special query characters

~, ", -,and!).

However, && by itself is no t a word. It is a Boolean operator. When combined with at least one more

word characte

r, && canbepartofaword. Forexample,a&&b is a word.

Query analysis and document indexing are not case-sensitive. Uppercase and lowercase let ters are

treated the same.

Regular expression deﬁnition of English word characters

The following regular expression provides, in succinct form, a complete speciﬁcation of English word

characters (except for treatment of && as a non-word):

[ A-Za-z0-9_#& ]+

Letters and digits in different character sets

Topics include:

•

Letters and digits deﬁned, page 38

•

Letters and

digits in ﬁles, page 38

Letters and digits deﬁned

All letters and digits are word characters. What IAP considers a letter or digit d e pends on the character

set encoding used. For US ASCII encoding, let ters are uppercase and lowercase English letters (A-Z, a-z).

For ISO 8859-1 (Latin-1) encoding, used for Western Europea n languages, accented letters are included.

Most ideographic characters, s uch as those used in Asian languages, are also considered let ters.

Whatever the language and encoding used for a particular document (ﬁle or email message), IAP maps

encoded characters to the Unicode 2.0 standard. The Unicode 2.0 standard is then used to determine if

a given character is a let ter or a digit (or neither):

• A l et ter is any Unicode character in one of the following Unicode categories: Ll (lowercase letter),

Lu(uppercaseletter),Lt(titlecaseletter),Lm(modiﬁer letter), or Lo (other letter) .

• A digit is any Unicode character whose Unicode name contains the word D IGIT, provided it is not

in the range \u2000 (en quad = en space) through \u2FFF (ideographic description - future).

Letter

sanddigitsinﬁles

Althou

gh all letters a nd digits are word characters, their treatment in ﬁles (including email message

attac

hments) dep ends on the character encoding used. You can search for any words in em ail message

bodie

s and headers, regardless of the encoding.

You ca

nsearchforwordsinﬁles (including email body, header, attachments, and indexed documents)

prov

ided the character encoding is one the following:

Query expression syntax and matching