System information
141
CONFIGURING AND ADMINISTERING COLDFUSION 9
Indexing Collections with Verity Spider
Last updated 2/21/2012
The -mimeinclude option does not let you index desired documents if the starting URL is not followed. For the MIME
variable, you can include the asterisk (*) wildcard for text strings; for example:
'text/*'
In Windows, include double-quotation marks around the argument to protect the special character (*). On UNIX, use
single-quotation marks. This is only required when you run the indexing job from a command line. Quotation marks
are not necessary within a command file (the
-cmdfile option).
You cannot use the question mark (?) wildcard, and the -regexp option does not allow you to use regular expressions.
Example
If you want to index all Word documents at http://web.verity.com, you cannot use:
vspider -collection collname -style style_dir -start
http://web.verity.com -mimeinclude 'application/msword'
This is because the starting point does not match the -mimeinclude criteria. You can use the -indmimeinclude
option to follow all documents (unless you have specified any of the exclude options) and index only those documents
that match your criteria. Replace the
-mimeinclude option with the -indmimeinclude option in the preceding
example.
-indskip
Syntax
-indskip HTML_tag "exp"
Type
Web crawling only
Specifies that Verity Spider follow and parse links, but not index, any HTML document that contains the text of exp
within the given HTML_tag. For multiple HTML_tag and exp combinations, use multiple instances of the
-skip
option.
You can use wildcard expressions, where the asterisk (*) is for text strings and the question mark (?) is for single
characters; for example:
'/my_doc*/year199?'
In Windows, include double-quotation marks around the argument to protect the special characters, such as the
asterisk (*). On UNIX, use single-quotation marks. This is only required when you run the indexing job from a
command line. Quotation marks are not necessary within a command file (the
-cmdfile option).
If you use backslashes, double them so that they are properly escaped; for example:
C:\\test\docs\path
To use regular expressions, also specify the -regexp option.
Example 1
To skip all HTML documents that contain the word "personnel" in the Title element, while still parsing those
documents for links to other documents, use the following:
-indskip title "personnel"