HP IAP Version 2.0 User Guide (November 2008)
Proximity word s
equences
You can use simple word sequences to search for words separated by separators but not by o ther
words. To searc
h for document words that are in an ordered sequence, but might be separated by other
words, use a proximity word sequence.
To write a proximity word sequence, use the same syntax as a simple word sequence, but append a tilde
(~)charactert
o the second quote, a nd follow that with a numeric proximity value. The proximity value
represents the m aximum number of other document words that can occur between any two successive
words of the sequence. A document matches a proximity word sequence if all words occur in the
document in th
e same order, with at most N intervening words, where N is the proximity value.
For example, the sequence "bird garden stone"~3 matches any document that has these three
wordsinthisorder,withbird and garden separated by no more than three words, and garden and stone
separated by n
o more than three words. This sequence matches a document with the text abirdinthe
rose garden is near a stone because there are at m ost three words between successive sequence words.
This sequence also matches abirdgardenwithastonefor the same reason.
Simple word sequences are a special case of proximity word sequences: "..."isthesameas".
. ."~0.Anydocumentsfoundby". . ."~N are also found by ". . ."~M,whenM>N.
Matching word sequences in attachments
This section discusses word matching in attachments. Like other documents, IAP renders attachment
documents (like spreadsheets and PDF files) into text words. When IAP renders a document, it follows the
document application’s internal representation of the file.
Certain file types, for example spreadsheets, look very different internally than they do externally. This
means that word sequence in the external ap plication representation which the end user sees may
differ from the internal application representation. IAP quer y matching uses the internal application
representation. Below are a couple of examples to illustrate.
Example 1. Separators are ignored
IAP renders text into words. Remaining characters such as periods, commas, spaces, and newlines are
considered separators and are ignored. Phrase queries ignore all formatting elements and non-word
characters. The following original plain text of:
“This was news to Mr. Smith.
Johnson, however, knew bet ter.”
matches the phrase query of:
“Smith Johnson”
This is because internally, the two plain text sentences are represented as one long string of continuous
words: “This was news to Mr Smith Johnson however knew better”.
Example 2. Sequence is no t intuit ive
Internally in a n attachment’s original application, a large multi-page document or a single page
spreadsheet equates to a long text sequence. Text may not appear in the same sequence internally as
it appears externally. Also, multiple instances of the same text in certain file types are represented
as a single instance.
Spreadsheets
Look at the external representation of the following example spreadsheet.
User Guide
41