Google Search Appliance External Metadata Indexing Guide Google Search Appliance software version 7.
Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 www.google.com GSA-META_100.03 December 2013 © Copyright 2013 Google, Inc. All rights reserved. Google and the Google logo are, registered trademarks or service marks of Google, Inc. All other trademarks are the property of their respective owners. Use of any Google solution is governed by the license agreement included in your original contract.
Contents External Metadata Indexing Guide ...............................................................................
External Metadata Indexing Guide This guide is for developers and administrators of the Google Search Appliance who have documents with metadata that is not stored directly in the primary document. A primary document is a record, file, or web page that the search appliance treats as a document to index or serve. The guide explains how to use the external metadata indexing capabilities of the search appliance, either through the use of the Feeds system or the Database Crawler.
When you index external metadata, it is searchable in the same way that other metadata is searchable. For example, you can use the partialfields and requiredfields query operators to search for documents with particular metadata. For more information about metadata queries and query operators, see the Search Protocol Reference.
External Metadata Stored in a Database There are three scenarios for indexing external metadata that is stored in a database, depending on how your primary document is referenced and stored. For each of these scenarios, the search appliance indexes a meta name for each field in your crawl query and meta content for the value in that field. If you want to use an alias for a field name, you can use the SQL keyword AS in the crawl query to give the field a more meaningful name.
In this scenario, the search appliance queries the database for data, then submits a feed with the resulting rows. The search appliance crawls and indexes the set of records that is defined by the crawl query. The URLs extracted from each external metadata record (as defined by the URL field) are added to the crawl queue and crawled by either the web or file system crawler, following the normal crawl policy.
In this scenario, the search appliance queries the database for data, then submits a feed with the resulting rows. The search appliance extracts and indexes the recordset that is defined by the crawl query. The URLs constructed from each external metadata record (as defined in Document ID Field and the Base URL field) are added to the crawl queue and crawled by either the web or file system crawler, following the normal crawl policy.
External Metadata Pushed in a Feed The remaining scenarios use feeds. Feeds work well when the external metadata is not stored in a relational database, the primary document is not accessible by the search appliance’s crawlers, or the reference between the external metadata and the primary document is not easily expressed. You can use a feeds-based solution in any of these cases or any case where you prefer using feeds to implementing the database scenarios.
2. Create a element for each primary document. In the element, insert one or more elements, as shown in the following example: PAGE 11Scenario 5 Metadata: Inserted into the feed XML file. Primary Document: Referenced by the URL in the feed XML file (web feed). This scenario is similar to the previous scenario, except that the primary document is referenced by URL only (instead of the contents of the primary document being fed to the search appliance). The feed file therefore contains the information and, for each element, the URL of the record and the elements. 1.
Each value has the form: meta-name=meta-value. Both the meta-name and the meta-value are encoded according to section 2 of RFC3986 (http://www.ietf.org/rfc/rfc3986.txt) (commonly known as percentencoding). The search appliance does not transform ‘+’ to space. The following restrictions apply to this header: • The meta-value cannot be empty. • The values for the meta-name and meta-value cannot contain embedded quotation marks.