Specifications

2-6
Cisco Internet Streamer CDS 2.0-2.3 Software Configuration Guide
OL-13493-04
Chapter 2 Network Design
Delivery Service
later processing. The crawler starts with a list of URLs to visit, identifies every web link in the page, and
adds every link to the list of URLs to visit. The process ends after one or more of the following conditions
are met:
Links have been followed to a specified depth.
Maximum number of objects has been acquired.
Maximum content size has been acquired.
The crawler works as follows:
1. The Content Acquirer requests the starting URL that was configured for the delivery service.
2. The crawler parses the HTML at that URL for links to other files.
3. If links to other files are found, the files are requested.
4. If those files are HTML files, they are also parsed for links to additional files.
In this manner, the Content Acquirer “crawls” through the origin server.
Note The crawler cannot parse JavaScript or VBScript to get the links, nor does it work with HTTP cookies.
A website that has indexing enabled and the default document feature disabled generates HTML that
contains a directory listing whenever a directory URL is given. That HTML contains links to the files in
that directory. This indexing feature makes it very easy for the crawler to get a full listing of all the
content in that directory. The crawler searches the folders rather than parsing the HTML file; therefore,
directory indexing must be enabled and the directory cannot contain index.html, default.html, or
home.html files.
In FTP acquisition, the crawler crawls the folder hierarchy rather than parsing the HTML file. Content
ingest from an SMB server for crawl jobs is similar to FTP ingest; that is, the crawler crawls the folder
hierarchy rather than parsing the HTML file.
Content Acquirer
The Content Acquirer parses the Manifest file configured for the delivery service and generates the
metadata. If the hybrid ingest attributes are not specified, the Content Acquirer ingests the content after
generating the metadata. The Content Acquirer can be shared among many delivery services; in other
words, the same Service Engine can perform the Content Acquirer role for another delivery service.
SMB Servers
The CDS supports file acquisition from Windows file servers with shared folders and UNIX servers
running the SMB protocol. The Content Acquirer first mounts the share folder. This mount point then
acts as the origin server from which the content is fetched. The Content Acquirer fetches the content and
stores it locally.
Note With SMB, files greater than two gigabytes cannot be ingested.