Manual
Table Of Contents
- Getting the Most from Your Google Search Appliance
- Contents
- Introduction
- Planning
- Setting Up
- Crawling and Indexing
- Search Experience
- Using Features to Enhance the Search Experience
- Using Front Ends
- Forcing Specific Documents to the Top of Search Results
- Suggesting Alternative Search Terms along with Results
- Grouping Search Results by Topic
- Providing Options for Navigating Search Results
- Displaying Expert Profiles with Search Results
- Providing Real-Time Connectivity to Business Applications
- Integrating Personal Content from Google Apps
- Restricting Search Results
- Controlling Automatic Searching of Synonyms
- Influencing Results Rankings
- Segmenting the Index
- Providing User Results
- Enabling User Alerts
- Displaying Translations of Search Results
- Showing Document Previews in Search Results
- Customizing the User Interface
- Collecting Metrics about User Clicks
- Essentials
- Using the Admin Console
- Using Language Options
- Extending Universal Search
- Monitoring a Search Appliance
- Getting Help
- Quick Reference
- Index

Google Search Appliance: Getting the Most from Your Google Search Appliance Crawling and Indexing 16
• Product documentation
• Marketing literature
The Google Search Appliance supports crawling of many types of formats, including word processing,
spreadsheet, presentation, and others.
The Google Search Appliance crawls content on web sites or file systems according to crawl patterns
that you specify by using the Admin Console. As the search appliance crawls public content sources, it
indexes documents that it finds. To find more documents, the crawler follows links within the
documents that it indexes. The search appliance does not crawl content that you to exclude from the
index.
The following figure provides an overview of crawling public content.
What Content Is Not Crawled?
The Google Search Appliance does not crawl unlinked URLs or links that are embedded within an area
tag. Also, the search appliance does not crawl or index content that is excluded by these mechanisms:
• Do not follow and crawl URLs that you specify by using the Crawl and Index > Crawl URLs page in
the Admin Console
• robots.txt file—The Google Search Appliance always obeys the rules in robots.txt (see “Content
Prohibited by a robots.txt File” in Administering Crawl) and it is not possible to override this feature.
Before the search appliance crawls any content servers in your environment, check with the content
server administrator or webmaster to ensure that robots.txt allows the search appliance user agent
access to the appropriate content (see “Identifying the User Agent” in Administering Crawl)
• nofollow robots META tags that appear in content sources
Typically, webmasters, content owners, and search appliance administrators create robots.txt files and
add META tags to documents before a search appliance starts crawling.