HP IAP Version 2.0 User Guide (November 2008)

Matching simila
rwords
Topics include:
Fuzzy words,page40
Measuring wor
d similarity, page 40
Fuzzy words
You can search for do cument words that are textually sim ilar to a given literal query word (that is, one
containing no wildcards). To do this, append a tilde (~) character to the word, creating a fuzzy word.
For example, the fuzzy word define~ matches the similar words dened and denite,butdoesnot
match dening, denition, indenite,orpine.Italsomatchesdene itself.
Measuring w
ord similarity
Theeditdis
tance (also called Levenshtein distance) between two words is the number of single-character
operations (deletion, replacement, or insertion) required to change one word into the other word.
For example, the edit distance between dene and pine is three: two deletions (d and e)andone
replacement (f by p). The distance b et ween dene and denite is also three (e replaced by i; te inserted).
The search
engine considers dene more similar to denite than to pine,eventhoughtheeditdistances
are the sam
e (three) , because the edit distance (number of character changes) is compared to the word
length (o
f the shorter of the query and document words). Two words are closer, for querying purposes, if
it takes less to change one word into the other word relative to their leng ths.
The similarity ratio used by the search engine is d/min(query, doc), where d is the edit distance, min is a
function that returns the lesser of its arguments, and query and doc are the lengths of the query word a nd
document word, respectively. A fuzzy word matches a document word if this ratio is no more than 0.5.
Examples:
Words Compared Similarity Ratio Match ?
dene, d
enite
3/min(6
,8)=3/6=0.5
yes
dene,pine 3/min(6,4)=3/4=0.75
no(0.75>0.5)
Matching word sequences
You can use word sequences to nd documents with words that occur in a specied order and are
separated by a specied maximum distance.
Topics include:
Simple word sequences, page 40
Proximity word sequences, page 41
Matching word sequences in attachments,page41
Simpl
ewordsequences
To search for an ordered sequ ence of words, use a simple word sequence, which is a list of literal
quer
y words (no wildcards) separated by spaces (or other separators) and enclosed in quotes ("). A
document matches a simple word sequence if all words occur in the document in the same order, with
no intervening words.
For e
xample, the sequence "like a rolling stone" does not match a document with the text like a
large rolling stone because of the intervening word large.
40
Query expression syntax and matching