User Guide

102 Chapter 9: Indexing Collections with Verity Spider
Web standard support
Verity Spider supports key web standards used by Internet and intranet sites. Standard HREF
links and frames pointers are recognized, so that navigation through them is supported.
Redirected pages are followed so that the real underlying document is indexed. Verity Spider
adheres to the robots exclusion standard specified in robots.txt files, so that administrators can
maintain friendly visits to remote websites. HTTP Basic Authentication mechanism is supported
so that password-protected sites can be indexed.
Unlike other web crawlers, Verity Spider does not need to maintain complete local copies of
remote documents. When documents are viewed through Verity Information Server, documents
are read from their native location with optional highlights.
Restart capability
When an indexing job fails, or for some reason the Verity Spider cannot index a significant
number or type of URLs, you can now restart the indexing job to update the collection. Only
those URLs that were not successfully indexed previously are processed.
State maintenance through a persistent store
Verity Spider V3.7 stores the state of gathered and indexed URLs in a persistent store, which lets
it track progress for the purposes of gracefully and efficiently restarting halted indexing jobs.
Previous versions of Verity Spider only held state information in memory, which meant that any
stoppage of spidering resulted in lost work. This also meant that larger target sites required
significantly more memory for spidering. The information in the persistent store can help report
information, such as the number of indexed pages, visited pages, rejected pages, and broken links.
Performance
Spidering performance is greatly improved over previous versions, because of low memory
requirements, flow control, and the help of multithreading and efficient Domain Name System
(DNS) lookups.
Flow control
When indexing websites, Verity Spider distributes requests to web servers in a round-robin
manner. This means that one URL is fetched from each web server in turn. With flow control, a
faster website can finish before a slower one. The Verity Spider optimizes indexing on every web
server.
Verity Spider V3.7 adjusts the number of connections per server depending on the download
bandwidth. When the download bandwidth from a web server falls below a certain value, Verity
Spider automatically scales back the number of connections to that web server. There will always
be at least one connection to a web server. When the download bandwidth increases to an
acceptable level, Verity Spider reallocates connections (per the value of the
-connections option,
which is 4 by default). You can turn off flow control with the -noflowctrl option.