Google Search Appliance Feeds Protocol Developer’s Guide Google Search Appliance software version 7.
Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 www.google.com GSA-FEEDS_100.09 December 2013 © Copyright 2013 Google, Inc. All rights reserved. Google and the Google logo are, registered trademarks or service marks of Google, Inc. All other trademarks are the property of their respective owners. Use of any Google solution is governed by the license agreement included in your original contract.
Contents Feeds Protocol Developer’s Guide .................................................................................
Troubleshooting Error Messages on the Feeds Status Page Feed Push is Not Successful Fed Documents Aren’t Appearing in Search Results Document Feeds Successfully But Then Fails Fed Documents Aren’t Updated or Removed as Specified in the Feed XML Document Status is Stuck “In Progress” Insufficient Disk Space Rejects Feeds Feed Client TCP Error Example Feeds Web Feed Web Feed with Metadata Web Feed with Base64 Encoded Metadata Full Content Feed Incremental Content Feed Python Implementation of Creating a base6
Feeds Protocol Developer’s Guide This document is for developers who use the Google Search Appliance Feeds Protocol to develop custom feed clients that push content and metadata to the search appliance for processing, indexing, and serving as search results. To push content to the search appliance, you need a feed and a feed client: • The feed is an XML document that tells the search appliance about the contents that you want to push.
The search appliance does not support indexing compressed files sent in content feeds. The search appliance follows links from a content-fed document, as long as the links match URL patterns added under Follow and Crawl Only URLs with the Following Patterns on the Content Sources > Web Crawl > Start and Block URLs page in the Admin Console. Web feeds and content feeds behave differently when deleting content.
Quickstart Here are steps for pushing a content feed to the search appliance. 1. Download sample_feed.xml to your local computer. This is a content feed for a document entitled “Fed Document”. 2. In the Admin Console, go to Content Sources > Web Crawl > Start and Block URLs and add this pattern to “Follow and Crawl Only URLs with the Following Patterns”: http://www.localhost.example.com/ This is the URL for the document defined in sample_feed.xml. 3. Download pushfeed_client.py to your local computer.
Choosing a Name for the Feed Data Source When you push a feed to the search appliance, the system associates the fed URLs with a data source name, specified by the datasource element in the feed DTD. • If the data source name is “web”, the system treats the feed as a web feed. A search appliance can only have one data source called “web”. • If the data source name is anything else, and the feed type is metadata-and-url, the system treats the feed as a web feed.
To ensure that the search appliance does not crawl a previously fed document, use googleoff/googleon tags (see “Excluding Unwanted Text from the Index” in Administering Crawl) or robots.txt (see “Using robots.txt to Control Access to a Content Server” in Administering Crawl). To update the document, you need to feed the updated document to the search appliance. Documents fed with web feeds, including metadata-and-urls, are recrawled periodically, based on the crawl settings for the search appliance.
• displayurl—The URL that should be provided in search results for a document. This attribute is useful for web-enabled content systems where a user expects to obtain a URL with full navigation context and other application-specific data, but where a page does not give the search appliance easy access to the indexable content. • action—Set action to add when you want the feed to overwrite and update the contents of a URL. If you don’t specify an action, the system performs an add.
Grouping Records Together Record elements must be contained inside the group element. The group element also allows you to apply an action to many records at once. For example, this: Is equivalent to this: PAGE 12Here is a record definition that includes base64 encoded content: Zm9vIGJhcgo Because base64 encoding increases the document size by one third, it is often more efficient to include non-text documents as URLs in a web feed. Only contents that are embedded in the XML feed must be encoded; this restriction does not apply to contents that are crawled. Content Compression Starting in Google Search Appliance version 6.
If the metadata is part of a feed, it must have the following format: ... Note: The content= attribute cannot be an empty string (""). For more information, see “Document Feeds Successfully But Then Fails” on page 35. In version 6.2 and later, content feeds support the update of both content and metadata. Content feeds can be updated by just sending new metadata.
The authmethod attribute for the record defines the type of authentication. By default, authmethod is set to “none”. To enable secure search from a feed, set the authentication attribute for the record to ntlm, httpbasic, or httpsso. For example, to enable authentication for protected files on localhost.example.com via Forms Authentication, you would define the record as: PAGE 15Per-URL ACLs and ACL Inheritance A per-URL ACL (access control list) has only a single URL associated with it. You can use feeds to add perURL ACLs to the search appliance index. To specify a per-URL ACL, use the acl element, as described in “Specifying Per-URL ACLs” on page 15. ACL information can be applied to groups of documents through inheritance. To specify ACL inheritance, use the attributes described in “Specifying ACL Inheritance” on page 17.
principal Element To specify the principal, its name, and access to a document use the principal element. The principal element is a child of the acl element.
principal-type Attribute The principal-type attribute indicates that the domain string attached to the principal will not be transformed internally by the search appliance. The only valid value is “unqualified.” This attribute is for support of SharePoint local groups. Specifying ACL Inheritance While ACLs can be found attached to documents, content systems allow for ACL information to be applied to groups of documents through inheritance.
Approaches to Using the acl Element There are two approaches to using the acl element: • As the child of a group element • As the child of a record element If the acl element is the child of a group element, the url attribute is required. An acl element as the child of a group element can be used in the following scenarios: • Updating the ACL of a record (independently of content) for a URL that was previously fed with attached ACLs.
Legacy Metadata Format (Deprecated) For compatibility with feeds developed before software release 7.0, the search appliance supports the legacy metadata format for specifying per-URL ACLs in feeds. The legacy approach is limited: it does not support namespaces or case sensitivity. However, the following meta names enable you to specify ACL inheritance in metadata format: • google:aclinheritfrom • google:aclinheritancetype The valid value for google:aclinheritfrom is a URL string.
Specifying Denial of Access to Users and Groups The search appliance supports DENY ACLs. When a user or group is denied permission to view the URL, it does not appear in the search results. You can specify users and groups that are not permitted to view a document by using meta tags, as shown in the following examples. To specify denial of access, the value of the name attribute must be google:acldenyusers or google:acldenygroups.
Feeding Groups to the Search Appliance The search appliance can experience increased latency when establishing a user’s identity and the groups that it belongs to. You can dramatically reduce the latency for group resolution by periodically feeding groups information to the search appliance. When the groups information is on the search appliance, it is available in the security manager for resolving groups at authentication time. Consequently, the information works for all authorization mechanisms.
A principal element can have the following attributes: • scope • namespace • case-sensitivity-type • principal-type scope Attribute The scope attribute specifies the type of the principal. Valid values are: • USER • GROUP The scope attribute is required. namespace Attribute By keeping principals in separate namespaces, the search appliance is able to ensure that access to secure documents is maintained unambiguously. Namespaces are crucial to security when a search user has multiple identities.
Example Feed with Groups The following code shows an example of a feed XML file with groups. abc.
Groups Feed Document Type Definition
The following example shows a groups feed client written in Python. # Copyright 2013 Google Inc. All Rights Reserved. """A helper script that pushes a groups xml file to the feeder.""" import import import import getopt mimetypes sys urllib2 def PrintMessage(): """Print help message for command usage.""" print """Usage: %s ARGS --groupsource: sharepoint, ggg, ldap or others --url: groupsfeed url of the feedergate, e.g.
def PostMultipart(theurl, fields, files): """Create the POST request by encoding data and adding headers.""" content_type, body = EncodeMultipartFormdata(fields, files) headers = {} headers["Content-type"] = content_type headers["Content-length"] = str(len(body)) return urllib2.Request(theurl, body, headers) def EncodeMultipartFormdata(fields, files): """Create data in multipart/form-data encoding.""" boundary = "----------boundary_of_feed_data$" crlf = "\r\n" l = [] for (key, value) in fields: l.
Feeding Content from a Database To push records from a database into the search appliance’s index, you use a special content feed that is generated by the search appliance based on parameters that you set in the Admin Console. To set up a feed for database content, log into the Admin Console and choose Content Sources > Databases.
Designing a Feed Client You upload an XML feed using an HTTP POST to the feedergate server located on port 19900 of your search appliance. The search appliance also supports HTTPS access to the feedergate server through port 19902, enabling you to upload an XML feed file by using a secure connection. An XML feed must be less than 1 GB in size. If your feed is larger than 1 GB, consider breaking the feed into smaller feeds that can be pushed more efficiently.
To adapt this form for your search appliance, replace APPLIANCE-HOSTNAME with the fully qualified domain name of your search appliance.
The success message indicates that the feedergate process has received the XML file successfully. It does not mean that the content will be added to the index, as this is handled asynchronously by a separate process known as the “feeder”. The data source will appear in the Feeds page in the Admin Console after the feeder process runs. The feeder does not provide automatic notification of a feed error.
For content feeds, the content is provided as part of the XML and does not need to be fetched by the crawler. URLs are passed to the server that maintains Crawl Diagnostics in the Admin Console. This will happen within 15 minutes if your system is not busy. The feeder also passes the URLs and their contents to the indexing process. The URLs will appear in your search results within 30 minutes if your system is not busy.
Feed Files Awaiting Processing To view a count of how many feed files remain for the search appliance to process into its index, add / getbacklogcount to a search appliance URL at port 19900. The count that this feature provides can be used to regulate the feed submission rate. The count also includes connector feed files.
Troubleshooting Here are some things to check if a URL from your feed does not appear in the index. To see a list of known and fixed issues, see the latest release notes for each version. Error Messages on the Feeds Status Page If the feeds status page shows “Failed in error” you can click the link to view the log file. ProcessFeed: parsing error This message means that your XML file could not be understood.
Fed Documents Aren’t Appearing in Search Results Some common reasons why the URLs in your feed might not be found in your search results include: 1. The crawler is still running. Wait a few hours and search again. For large document feeds containing multiple non-text documents, the search appliance can take several minutes to process all of the documents. You can check the status of a document feed by going to the Content Sources > Feeds page.
Document Feeds Successfully But Then Fails A content feed reports success at the feedergate, but thereafter, reports the following document feed error: Failed in error documents included: 0 documents in error: 1 error details: Skipping the record, Line number: nn, Error: Element record content does not follow the DTD, Misplaced metadata This error occurs when a metadata element contains a content attribute with an empty string, for example: If the content attribute value is an
Document Status is Stuck “In Progress” If a document feed gives a status of “In Progress” for more than one hour, this could mean that an internal error has occurred. Please contact Google to resolve this problem, or you can reset your index by going to Administration > Reset Index. Insufficient Disk Space Rejects Feeds If there is insufficient free disk space, the search appliance rejects feeds, and displays the following message in the feed response: Feed not accepted due to insufficient disk space.
Web Feed Web Feed with Metadata
Web Feed with Base64 Encoded Metadata example3 metadata-and-url Full Content Feed
Incremental Content Feed UPDATED - This is hello02 PAGE 40Python Implementation of Creating a base64 Encoded Content Feed The following create_base64_content_feeds.py script goes through all PDF files under MY_DIR and creates a content feed for each of them that is added to the base64_pdfs.xml file. This file can then be used to add the documents that are under MY_DIR to the index. import base64 import os MY_DIR = '/var/www/files/' MY_FILE = 'base64_pdfs.xml' def main(): files = os.listdir(MY_DIR) if os.path.exists(MY_FILE): os.
Google Search Appliance Feed DTD The gsafeed.dtd file follows. You can view the DTD on your search appliance by browsing to the http://:7800/gsafeed.dtd URL.
Index A C access attribute 16 acl element approaches to using 18 description 15–17 ACL inheritance description 15 specifying 17 action attribute 8, 10 Administration > Reset Index page 36 and-both-permit 17 attributes access 16 action 8, 10 authentication 14 authmethod 10, 14 case-sensitivity-type 16, 22 content 13, 19, 35 crawl-immediately 10 crawl-once 10 displayurl 10, 32 encoding 12 inheritance-type 17 inherit-from 17 last-modified 10 lock 10, 32 mimetype 10 name 13, 19 namespace 16, 22 pagerank 10 p
crawl access to protected content 14 crawler access 14 databases 27 diagnostics 7, 15, 31 do not crawl fed documents 9 efficiency 9 fed document 8 fed documents not in results 34 forms authentication 14 maximum number of URLs 32 MIME type 10 protected content 10 schedule 30 settings 5 URL as unique identifier 9 URLs in a feed 30 web feeds 9 when to use feeds 6 crawl-immediately attribute 10 crawl-once attribute 10 D data source name 8 recent pushes 8 specified in feed file 8 database feeds 27 datasource el
license limit 32 lock attribute 10, 32 log file 33 M Make Public check box 14 members element 21 membership element 21 meta element 13 metadata adding to records 12 base64 encoding 13 external 6 metadata element 35 metadata-and-url feeds description 5 feedtype element 8 metadata 12 removing content from the index 31 Microsoft SharePoint 17 Microsoft Windows File System 17 Microsoft Windows Share 17 MIME type 10, 11 mimetype attribute 10 N name attribute 13, 19 namespace attribute 16, 22 NTLM authenticaion