User Guide
154 Chapter 8 Verity Spider
By default, each indexing thread uses as much memory as is available from the
system.
-maxnumdoc
Syntax: -maxnumdoc num_docs
Specifies the maximum number of documents to be downloaded or submitted for
indexing. The value for num_docs does not necessarily correspond exactly to the
number of documents indexed. The following factors affect the actual number.
Whether or not the value of
num_docs falls within a block of documents dictated by
-submitsize. If it does, the entire block of documents must be processed.
Whether or not documents retrieved are actually indexed because they are invalid or
corrupt.
-mimemap
Syntax: -mimemap path_and_filename
Specifies a control file (simple ASCII text) that maps file extensions to MIME-types.
This allows you to make custom associations and override defaults.
The format for the control file is:
#file_ext_no_dot mime-type
abc application/word
-nocache
Typ e: Web crawling only
Used with
-noindex or -nosubmit, this option disables the caching of files during
Web site indexing. This has the effect of decreasing the demands on your disk space.
Normally, Verity Spider downloads URLs and then writes them to a bulk insert file
and downloads the documents themselves. When indexing occurs, once
-submitsize has been reached, the cached files are indexed and then deleted. If you
use
-noindex, the bulk insert file is submitted but not processed by Verity Spider, and
so the documents are not deleted until indexing occurs takes over. This will usually
be
mkvdk or collsvc, or you can subsequently use Verity Spider again with the
-processbif option.
By using -nocache in conjunction with -noindex or -nosubmit, you avoid storing
files locally at all. Files are downloaded only when indexing actually occurs.
See also -noindex.
-nodupdetect
Typ e: Web crawling only.
Disables checksum-based detection of duplicates when indexing Web sites.
URL-based duplicate detection is still performed.