Specifications

B-5
Cisco Internet Streamer CDS 2.0-2.3 Software Configuration Guide
OL-13493-04
Appendix B Creating Manifest Files
Working with Manifest Files
Note If you specify both the max-number and maxTotalSizeIn attributes as the criteria to use to stop a crawl
job, the condition that is met first takes precedence. The crawl job stops either when the maximum
number of objects is acquired or when the maximum content size is reached, whichever occurs first. For
example, if the crawl job has acquired the maximum number of objects specified in the Manifest file but
has not yet reached the maximum content size, the crawl job stops.
The following is an example of a website crawl job:
<server name="cisco">
<host name="http://www.cisco.com/jobs/" />
</server>
<crawler
server="cisco"
start-url="eng/index.html"
depth="10"
prefix="eng/"
reject="\.pl"
maxTotalSizeIn-MB="200"
/>
This website crawl job example contains the following attributes:
The start-url path is http://www.cisco.com/jobs/eng/index.html.
Search to a website link depth of 10.
Search URLs with the prefix http://www.cisco.com/jobs/eng/.
Reject URLs containing .pl (Perl script pages).
Only crawl until 200 megabytes in total content size are acquired.
If the server name attribute is omitted, the server name in the last specified <server> tag above it is used.
If there are no <server> tags close by in the Manifest file, the server that hosts the Manifest file is used,
which means that the relative URL is relative to the Manifest file URL.
Writing Common Regular Expressions
A regular expression is a formula for matching strings that follow a recognizable pattern. The following
special characters have special meanings in regular expressions:
. * \ ? [ ] ^ $
If the regular expression string does not include any of these special characters, then only an exact match
satisfies the search. For example, “stock” must match the exact substring “stock.
Scheduling Content Acquisition
Two attributes, ttl and prefetch, are used to schedule content acquisition. Use ttl to specify the frequency
of checking the content for freshness, in minutes. For example, to check for page freshness every day,
enter ttl=“1440.
In the following example, page freshness is scheduled to be checked once a day:
<item
src="index.html"
ttl="1440"
/>