Specifications

2-6

Cisco Internet Streamer CDS 2.0-2.3 Software Configuration Guide

OL-13493-04

Chapter 2 Network Design

Delivery Service

later processing. The crawler starts with a list of URLs to visit, identifies every web link in the page, and

adds every link to the list of URLs to visit. The process ends after one or more of the following conditions

are met:

• Links have been followed to a specified depth.

• Maximum number of objects has been acquired.

• Maximum content size has been acquired.

The crawler works as follows:

1. The Content Acquirer requests the starting URL that was configured for the delivery service.

2. The crawler parses the HTML at that URL for links to other files.

3. If links to other files are found, the files are requested.

4. If those files are HTML files, they are also parsed for links to additional files.

In this manner, the Content Acquirer “crawls” through the origin server.

Note The crawler cannot parse JavaScript or VBScript to get the links, nor does it work with HTTP cookies.

A website that has indexing enabled and the default document feature disabled generates HTML that

contains a directory listing whenever a directory URL is given. That HTML contains links to the files in

that directory. This indexing feature makes it very easy for the crawler to get a full listing of all the

content in that directory. The crawler searches the folders rather than parsing the HTML file; therefore,

directory indexing must be enabled and the directory cannot contain index.html, default.html, or

home.html files.

In FTP acquisition, the crawler crawls the folder hierarchy rather than parsing the HTML file. Content

ingest from an SMB server for crawl jobs is similar to FTP ingest; that is, the crawler crawls the folder

hierarchy rather than parsing the HTML file.

Content Acquirer

The Content Acquirer parses the Manifest file configured for the delivery service and generates the

metadata. If the hybrid ingest attributes are not specified, the Content Acquirer ingests the content after

generating the metadata. The Content Acquirer can be shared among many delivery services; in other

words, the same Service Engine can perform the Content Acquirer role for another delivery service.

SMB Servers

The CDS supports file acquisition from Windows file servers with shared folders and UNIX servers

running the SMB protocol. The Content Acquirer first mounts the share folder. This mount point then

acts as the origin server from which the content is fetched. The Content Acquirer fetches the content and

stores it locally.

Note With SMB, files greater than two gigabytes cannot be ingested.