User Guide

Paths and URLs Options 165
-nodocrobo
Specifies ROBOT META tag directives are to be ignored.
In HTML 3.0 and earlier, robot directives could only be given as the file robots.txt
under the root directory of a Web site. In HTML 4.0, every document can have robot
directives embedded in the META field. Use this option to ignore them. This option
should, of course, be used with discretion.
See Also -norobo and
http://www.w3c.org/TR/REC-html40/html40.txt.
-nofollow
Syntax: -nofollow "exp"
Typ e: Web crawling only.
Specifies Verity Spider cannot follow any URLs which match the expression exp. If
you do not specify a exp value for -nofollow, then Verity Spider assumes a value of "*"
where no documents are followed.
You can use wildcard expressions, where the asterisk ( * ) is for text strings and the
question mark ( ? ) is for single characters. You should always encapsulate the exp
values in double quotes to ensure they are properly interpreted.
If you use backslashes, you must double them so they are properly escaped. For
example:
C:\\test\\docs\\path
To use regular expressions, also specify the -regexp option.
Previous versions of the Verity Spider did not allow the use of an expression. This
meant that for each starting point URL, only the first document would be indexed.
With the addition of the expression functionality, you can now selectively skip URLs
even within documents.
See also -regexp
-norobo
Type: Web crawling only.
Specifies that any
robots.txt files encountered are ignored. The robots.txt file is
used on many Web sites to specify what parts of the site indexers should avoid. The
default is to honor any
robots.txt files.
If you are re-indexing a site and
robots.txt has changed, the Verity Spider will
delete documents that have been newly disallowed by
robots.txt.
This option should, of course, be used with discretion and extreme care, especially in
conjunction with -cgiok.
See Also -nodocrobo and
http://info.webcrawler.com/mak/projects/robots/
norobots.html
.