User Guide

ManualsBrandsMacromedia ManualsOtherCOLDFUSION MX 61

101

102

103

104

105

106

107

108

109

110

102 Chapter 9: Indexing Collections with Verity Spider

Web standard support

Verity Spider supports key web standards used by Internet and intranet sites. Standard HREF

links and frames pointers are recognized, so that navigation through them is supported.

Redirected pages are followed so that the real underlying document is indexed. Verity Spider

adheres to the robots exclusion standard specified in robots.txt files, so that administrators can

maintain friendly visits to remote websites. HTTP Basic Authentication mechanism is supported

so that password-protected sites can be indexed.

Unlike other web crawlers, Verity Spider does not need to maintain complete local copies of

remote documents. When documents are viewed through Verity Information Server, documents

are read from their native location with optional highlights.

Restart capability

When an indexing job fails, or for some reason the Verity Spider cannot index a significant

number or type of URLs, you can now restart the indexing job to update the collection. Only

those URLs that were not successfully indexed previously are processed.

State maintenance through a persistent store

Verity Spider V3.7 stores the state of gathered and indexed URLs in a persistent store, which lets

it track progress for the purposes of gracefully and efficiently restarting halted indexing jobs.

Previous versions of Verity Spider only held state information in memory, which meant that any

stoppage of spidering resulted in lost work. This also meant that larger target sites required

significantly more memory for spidering. The information in the persistent store can help report

information, such as the number of indexed pages, visited pages, rejected pages, and broken links.

Performance

Spidering performance is greatly improved over previous versions, because of low memory

requirements, flow control, and the help of multithreading and efficient Domain Name System

(DNS) lookups.

Flow control

When indexing websites, Verity Spider distributes requests to web servers in a round-robin

manner. This means that one URL is fetched from each web server in turn. With flow control, a

faster website can finish before a slower one. The Verity Spider optimizes indexing on every web

server.

Verity Spider V3.7 adjusts the number of connections per server depending on the download

bandwidth. When the download bandwidth from a web server falls below a certain value, Verity

Spider automatically scales back the number of connections to that web server. There will always

be at least one connection to a web server. When the download bandwidth increases to an

acceptable level, Verity Spider reallocates connections (per the value of the

-connections option,

which is 4 by default). You can turn off flow control with the -noflowctrl option.