User Guide

146 Chapter 8 Verity Spider
Overview
The Verity Spider enables you to index Web-based and file system documents
throughout the enterprise. Verity Spider works in conjunction with the Verity
KeyView document filtering technology so that more than two hundred of the most
popular application document formats can be indexed, including Office2000 and
WordPerfect, ASCII text, HTML, SGML, XML and PDF (Adobe Acrobat) documents.
Supports Web standards
Verity Spider supports key Web standards used by Internet and intranet sites today.
Standard HREF links and frames pointers are recognized so that navigation through
them is supported. Redirected pages are followed so that the real underlying
document is indexed. Verity Spider adheres to the robots exclusion standard
specified in robots.txt files, so that administrators can maintain friendly visits to
remote Web sites. HTTP Basic Authentication mechanism is supported so that
password-protected sites can be indexed.
Unlike other Web crawlers, Verity Spider does not need to maintain complete local
copies of remote documents. When documents are viewed through Verity
Information Server, documents are read from their native location with optional
highlights.
Restart capability
When an indexing job fails, or for some reason the Verity Spider cannot index a
significant number or type of URLs, you can now restart the indexing job to update
the collection. Only those URLs which were not successfully indexed previously will
be processed.
State maintenance through a persistent store
Verity Spider V3.7 stores the state of gathered and indexed URLs in a persistent store,
allowing it to track progress for the purposes of gracefully and efficiently restarting
halted indexing jobs.
Previous versions of Verity Spider only held state information in memory, which
meant that any stoppage of spidering resulted in lost work. This also meant that
larger target sites required significantly more memory for spidering. The information
in the persistent store can help report information such as the number of indexed
pages, number of visited pages, number of rejected pages, and number of broken
links.
Performance
With low memory requirements, flow control and the help of multithreading and
efficient Domain Name System (DNS) lookups, spidering performance is greatly
improved over previous versions.