System information

120
CONFIGURING AND ADMINISTERING COLDFUSION 9
Indexing Collections with Verity Spider
Last updated 2/21/2012
Flow control
When indexing websites, Verity Spider distributes requests to web servers in a round-robin manner. This means that
one URL is fetched from each web server in turn. With flow control, a faster website can finish before a slower one.
The Verity Spider optimizes indexing on every web server.
Verity Spider adjusts the number of connections per server depending on the download bandwidth. When the
download bandwidth from a web server falls below a certain value, Verity Spider automatically scales back the number
of connections to that web server. There is always at least one connection to a web server. When the download
bandwidth increases to an acceptable level, Verity Spider reallocates connections (per the value of the
-connections
option, which is 4 by default). You can turn off flow control with the
-noflowctrl option.
Multi-threading
Verity Spider separates the gathering and indexing jobs into multiple threads for concurrence. Additionally, Verity
Spider can create concurrent connections to web servers for fetching documents, and have concurrent indexing
threads for maximum utilization. This translates to an overall improvement in throughput.
Efficient DNS lookups
Verity Spider minimizes DNS lookups, which means great improvements to lookups throughput. If lookups are
limited by domain or host, then no DNS lookups are made on hosts that fall outside that range. In earlier versions,
DNS lookups were made on all candidate URLs.
Proxy handling efficiency
To allow for greater flexibility when dealing with indexing jobs that involve proxy servers and firewalls, use the
following options:
-noproxy To reduce proxy checking for certain hosts
-proxyauth To authenticate on proxy servers
About Verity Spider syntax
Before you create an indexing task for a new collection, make copies of the relevant default style files to ensure that you
have a set of template style files in a known, stable state.
Running multiple simultaneous Verity Spider jobs can cause performance problems for searches. This does not mean
that you should never run indexing jobs when users might be searching, because your collections are available for
searching even while indexing jobs are running. To optimize performance, try staggering your indexing jobs to avoid
overloading your server.
The Verity Spider command
The vspider executable file, which starts the Verity Spider utility, is located in the platform/bin directory, as follows:
Server and multiserver configuration The vspider.exe (Window) or vspider (UNIX) file is located in
cf_root/verity/k2/platform/bin (server configuration) or jrun_root/verity/k2/platform/bin (multiserver configuration)
where platform is _nti40 for Windows, _solaris for Solaris, or _ilnx21 for Linux.
J2EE configuration The vspider.exe (Window) or vspider (UNIX) file is located in verity_root/k2/platform/bin where
platform is _nti40 for Windows, _solaris for Solaris, or _ilnx21 for Linux.
At its most basic level, a Verity Spider command consists of the following: