User Guide

ManualsBrandsMacromedia ManualsOtherCOLFUSION MX 7

121

122

123

124

125

126

127

128

129

130

Table Of Contents

126 Chapter 9: Indexing Collections with Verity Spider

-norobo

Type: Web crawling only

Specifies to ignore any robots.txt files encountered. The robots.txt file is used on many websites to

specify what parts of the site indexers should avoid. The default is to honor any robots.txt files.

If you are re-indexing a site and the robots.txt file has changed, Verity Spider deletes documents

that have been newly disallowed by the robots.txt file.

Use this option with discretion and extreme care, especially in conjunction with the

-cgiok

option.

See also

-nodocrobo.

-pathlen

Syntax:

-pathlen num_pathsegments

Limits indexing to the specified number of path segments in the URL or file system path. The

path length is determined as follows:

• The host name and drive letter are not included; for example, neither www.spider.com:80/ nor

C:\ would be included in determining the path length.

• All elements following the host name are included.

• The actual filename, if present, is included; for example, /world.html would be included in

determining the path length.

• Any directory paths between the host and the actual filename are included.

Example

For the following URL, the path length would be four:

http://www.spider:80/comics/fun/funny/world.html

<-1-> <2> <-3-> <---4--->

For the following file system path, the path length would be three:

C:\files\docs\datasheets

<-1-><-2-><---3--->

The default value is 100 path segments.

-refreshtime

Syntax:

-refreshtime timeunits

Specifies not to refresh any documents that have been indexed since the timeunits value began.

The following is the syntax for timeunits:

n day n hour n min n sec

Where n is a positive integer. You must include spaces, and since the first three letters of each time

unit are parsed, you can use the singular or plural form of the word.