HP StorageWorks Reference Information Storage System V1.0 User Guide (May 2004)

Query Expression Syntax and Matching Chapter 5:
Query Syntax and Matching
HP StorageWorks Reference Information Storage System User Guide, April 2004 5-9
So, why does the search engine consider
define
more similar to
definite
than to
pine
, even though the edit distances are the same (three)? Because the edit
distance (number of character changes) is compared to the word length (of the
shorter of the query and document words). Two words are closer, for purposes
of querying, if it takes less to change one into the other, relative to their
lengths.
The similarity ratio used by the search engine is thus d/min(query, doc),
where d is the edit distance, min is a function that returns the lesser of its
arguments, and query and doc are the lengths of the query word and
document word, respectively. A fuzzy word
matches
a document word if this
ratio is no more than 0.5.
Examples:
Matching Word Sequences
You can use word sequences to find documents with words in a specified
order that are separated a specified maximum distance.
Simple Word Sequences
To search for an ordered sequence of words, use a
simple word sequence
: a list
of literal query words (no wildcards) separated by spaces (or other separators)
and enclosed in double-quotes (
"
). A document
matches
a simple word
sequence if all the words occur in the document in the same order, with no
intervening words.
For example, the sequence
“like rolling stone”
matches a document with the text
“like a rolling stone” (the stop word “a” is not indexed), but it does not match a
document with the text “like a large rolling stone” because of the intervening
word “large.”
Words Compared Similarity Ratio Match?
define
, definite
3/min(6, 8) = 3/6 = 0.5 yes
define, pine
3/min(6, 4) = 3/4 = 0.75 no (0.75 > 0.5)