A content connector is a software program that traverses data in an enterprise repository and populates a data source. Google provides the following options for developing content connectors:
The Content Connector SDK. This is a good option for Java programmers. The SDK is a wrapper around the REST API that lets you quickly create connectors. To create a content connector using the SDK, see Create a content connector using the Content Connector SDK.
A low-level REST API or API libraries. Use these options if you don't use Java or if your codebase better accommodates a REST API or a library. To create a content connector using the REST API, see Create a content connector using the REST API.
A typical content connector performs the following tasks:
- Reads and processes configuration parameters.
- Pulls discrete chunks of indexable data, called "items," from the third-party repository.
- Combines ACLs, metadata, and content data into indexable items.
- Indexes items to the Cloud Search data source.
- (Optional) Listens for change notifications from the repository. Change notifications convert into indexing requests to keep the Cloud Search data source in sync. The connector only performs this task if the repository supports change detection.
Create a content connector using the Content Connector SDK
The following sections explain how to create a content connector using the Content Connector SDK.
Set up dependencies
Include these dependencies in your build file.
Maven
xml
<dependency>
<groupId>com.google.enterprise.cloudsearch</groupId>
<artifactId>google-cloudsearch-indexing-connector-sdk</artifactId>
<version>v1-0.0.3</version>
</dependency>
Gradle
groovy
compile group: 'com.google.enterprise.cloudsearch',
name: 'google-cloudsearch-indexing-connector-sdk',
version: 'v1-0.0.3'
Create your connector configuration
Every connector uses a configuration file for parameters like your repository ID.
Define parameters as key-value pairs, such as
api.sourceId=1234567890abcdef.
The Google Cloud Search SDK includes Google-supplied parameters for all connectors. You must declare the following in your configuration file:
- Content connector: Declare
api.sourceIdandapi.serviceAccountPrivateKeyFile. These identify your repository and the private key needed for access.
- Identity connector: Declare
api.identitySourceIdto identify your external identity source. For user syncing, also declareapi.customerId(the unique ID for your Google Workspace account).
Declare other Google-supplied parameters only to override their default values. For details on generating IDs and keys, see Google-supplied parameters.
You can also define repository-specific parameters in your configuration file.
Pass the configuration file to the connector
Set the config system property to pass the configuration file. Use the -D
argument when starting the connector. For example:
java -classpath myconnector.jar -Dconfig=MyConfig.properties MyConnector
If you omit this argument, the SDK attempts to use a file named
connector-config.properties in the local directory.
Determine your traversal strategy
The primary function of a content connector is to traverse a repository and index its data. You must implement a strategy based on your repository's size and layout. You can design your own or choose a strategy from the SDK:
- Full traversal strategy
- Scans the entire repository and indexes every item. This strategy is best for small repositories where you can afford the overhead of a full traversal during each indexing. Use it for small repositories with mostly static, non-hierarchical data, or when change detection is difficult.
- List traversal strategy
- Scans the entire repository to determine the status of each item, then indexes only new or updated items. Use this for incremental updates to a large, non-hierarchical index when change detection isn't supported.
- Graph traversal
- Scans a parent node to determine the status of its items, then indexes new or updated items in that node. It then recursively processes child nodes. Use this for hierarchical repositories where listing all IDs isn't practical, such as directory structures or websites.
The SDK implements these strategies in template connector classes. These templates can speed up your development. To use a template, see the corresponding section:
- Create a full traversal connector using a template class
- Create a list traversal connector using a template class
- Create a graph traversal connector using a template class
Create a full traversal connector using a template class
This section refers to code from the FullTraversalSample.
Implement the connector entry point
The entry point is the main() method. It creates an
Application
instance and calls
start()
to run the connector.
Before calling application.start(), use the
IndexingApplication.Builder
class to instantiate the
FullTraversalConnector
template. This template accepts a
Repository
object.
The SDK calls initConfig() after your main() method calls
Application.build(). The initConfig() method:
- Ensures the
Configurationis not already initialized. - Initializes the
Configurationobject with Google-supplied key-value pairs.
Implement the Repository interface
The Repository object traverses and indexes repository items. When using a
template, you only need to override certain methods in the Repository
interface. For FullTraversalConnector, override:
init(): For repository setup and initialization.getAllDocs(): To traverse and index all items. This is called once for each scheduled traversal.- (Optional)
getChanges(): If your repository supports change detection, override this to retrieve and index modified items. - (Optional)
close(): For repository cleanup during shutdown.
Each method returns an
ApiOperation
object, which performs indexing using IndexingService.indexItem().
Get custom configuration parameters
To handle your connector's configuration, you must retrieve any custom
parameters from the
Configuration
object. Perform this task in your
Repository
class's
init()
method.
The Configuration class includes methods to retrieve different data types.
Each method returns a ConfigValue object. Use the ConfigValue object's
get()
method to retrieve the value. This snippet from
FullTraversalSample
shows how to retrieve a custom integer value:
To retrieve and parse parameters with multiple values, use one of the
Configuration class's type parsers. This snippet from the tutorial connector
uses
getMultiValue
to retrieve a list of GitHub repository names:
Perform a full traversal
Override getAllDocs() to perform a full traversal. This method accepts a
checkpoint to resume indexing if interrupted. For each item:
- Set permissions.
- Set metadata.
- Combine them into a
RepositoryDoc. - Package each item into the iterator returned by
getAllDocs().
If the item set is too large for one call, use a checkpoint and call
hasMore(true).
Set the permissions for an item
Repositories use Access Control Lists (ACLs) to identify users or groups with access to an item. An ACL lists the IDs of authorized users or groups.
To ensure users only see search results they are authorized to access, you must replicate your repository's ACLs. Include the ACL when indexing an item so Google Cloud Search can provide the correct access level.
The Content Connector SDK includes classes and methods to model the ACLs of most repositories. Analyze your repository's ACLs and create corresponding ACLs for Cloud Search during indexing. Modeling complex ACLs, such as those using inheritance, requires careful planning. For more information, see Cloud Search ACLs.
Use the
Acl.Builder
class to set access. This snippet from the full traversal sample lets all
domain users (getCustomerPrincipal())
read all items (setReaders()):
Properly modeling repository ACLs, especially those using inheritance models, requires the information in Cloud Search ACLs.
Set the metadata for an item
Metadata is stored in an Item object. To create an Item, you need a unique
ID, item type, ACL, URL, and version. Use the
IndexingItemBuilder
helper class.
Create the indexable item
Use the
RepositoryDoc.Builder
class.
A RepositoryDoc is an ApiOperation that performs the
IndexingService.indexItem() request.
Use the
setRequestMode()
method of the RepositoryDoc.Builder class to set the indexing request to
ASYNCHRONOUS or SYNCHRONOUS:
ASYNCHRONOUS- This mode has longer indexing-to-serving latency but accommodates a larger throughput quota. Use asynchronous mode for initial indexing (backfill) of an entire repository.
SYNCHRONOUS- This mode has shorter indexing-to-serving latency but a smaller throughput
quota. Use synchronous mode for indexing repository updates and changes. The
request mode defaults to
SYNCHRONOUSif unspecified.
Package each indexable item in an iterator
The getAllDocs() method returns a
CheckpointCloseableIterable
of RepositoryDoc objects. Use the
CheckpointCloseableIterableImpl.Builder
class.
Next steps
- (Optional) If indexing throughput is slow, see Increase indexing rate.
- (Optional) Implement
close()to release resources. - (Optional) Create an identity connector.
Create a list traversal connector using a template class
The Cloud Search Indexing Queue holds IDs and optional hashes for repository items. A list traversal connector pushes IDs to this queue and retrieves them for indexing. Cloud Search maintains these queues to determine item status, such as deletions. See The Cloud Search Indexing Queue.
This section refers to the ListTraversalSample.
Implement the connector entry point
The main() method creates an Application instance and calls start(). Use
IndexingApplication.Builder to instantiate the
ListingConnector
template.
Implement the Repository interface
Override the following methods for ListingConnector:
init(): For repository setup.getIds(): To retrieve IDs and hashes for all records.getDoc(): To add, update, or delete items from the index.- (Optional)
getChanges(): For incremental updates using change detection. - (Optional)
close(): For repository cleanup.
Perform the list traversal
Override getIds() to retrieve IDs and hashes. Override getDoc() to handle
each item in the Cloud Search Indexing Queue.
Push item IDs and hash values
Override getIds() to fetch IDs and content hashes. Package them into a
PushItems
request to the Indexing Queue.
Use PushItems.Builder to package the IDs and hashes.
Retrieve and handle each item
Override getDoc() to handle items in the Indexing Queue. Items can be new,
modified, unchanged, or deleted.
- Check if the item ID exists in the repository. If not, delete it.
- Poll the index for status. If unchanged (
ACCEPTED), do nothing. - Index changed or new items: set permissions, set metadata, combine into a
RepositoryDoc, and return it.
Handle deleted items
This snippet shows how to determine if an item exists and delete it if not.
Handle unchanged items
Poll the Indexing Queue to handle unchanged items.
The example uses a hash to detect changes.
Set the permissions for an item
Repositories use Access Control Lists (ACLs) to identify users or groups with access to an item. An ACL lists the IDs of authorized users or groups.
To ensure users only see search results they are authorized to access, you must replicate your repository's ACLs. Include the ACL when indexing an item so Google Cloud Search can provide the correct access level.
The Content Connector SDK includes classes and methods to model the ACLs of most repositories. Analyze your repository's ACLs and create corresponding ACLs for Cloud Search during indexing. Modeling complex ACLs, such as those using inheritance, requires careful planning. For more information, see Cloud Search ACLs.
Use the
Acl.Builder
class to set access. This snippet from the full traversal sample lets all
domain users (getCustomerPrincipal())
read all items (setReaders()):
Properly modeling repository ACLs, especially those using inheritance models, requires the information in Cloud Search ACLs.
Set the metadata for an item
Create an indexable item
Use the
setRequestMode()
method of the RepositoryDoc.Builder class to set the indexing request to
ASYNCHRONOUS or SYNCHRONOUS:
ASYNCHRONOUS- This mode has longer indexing-to-serving latency but accommodates a larger throughput quota. Use asynchronous mode for initial indexing (backfill) of an entire repository.
SYNCHRONOUS- This mode has shorter indexing-to-serving latency but a smaller throughput
quota. Use synchronous mode for indexing repository updates and changes. The
request mode defaults to
SYNCHRONOUSif unspecified.
Next steps
Here are a few next steps you might take:
- (optional) Implement the
close()method to release any resources before shutdown. - (optional) Create an identity connector using the Content Connector SDK.
Create a graph traversal connector using a template class
The Cloud Search Indexing Queue holds IDs and optional hash values for each item in the repository. A graph traversal connector pushes item IDs to the Google Cloud Search Indexing Queue and retrieves them one at a time for indexing. Google Cloud Search maintains queues and compares queue contents to determine item status, such as whether an item has been deleted from the repository. For more information about the Cloud Search Indexing Queue, see The Google Cloud Search Indexing Queue.
During indexing, the item content is fetched from the data repository and any child item IDs are pushed to the queue. The connector recursively processes parent and child IDs until all items are handled.
Implement the connector's entry point
The entry point to a connector is the
main() method. This method creates an instance of the
Application
class and calls its
start()
method to run the connector.
Before calling
application.start(),
use the
IndexingApplication.Builder
class to instantiate the ListingConnector template. The
ListingConnector
accepts a
Repository
object whose methods you implement.
Implement the Repository interface
Override init(), getIds(), getDoc(), and optionally getChanges() or
close().
Perform the graph traversal
Override getIds() to retrieve initial IDs and getDoc() to handle items and
push child IDs to the queue.
Push item IDs and hash values
Retrieve and handle each item
- Check if the ID exists in the repository. If not, delete the item.
- For existing items, set permissions and metadata, and combine them into a
RepositoryDoc. - Push child IDs to the Indexing Queue.
- Return the
RepositoryDoc.
Handle deleted items
Set metadata and create the item
Place child IDs in the Indexing Queue
Create a content connector using the REST API
The following sections explain how to create a content connector using the REST API.
Determine your traversal strategy
The strategies (Full, List, and Graph) are conceptually the same as for the SDK. Implement your chosen strategy using the REST API.
Implement your traversal strategy and index items
Register your schema, then populate the index using:
- (Optional)
items.uploadfor files larger than 100 KiB. - (Optional)
media.uploadfor media files. items.indexto index the item.Example indexing request:
{ "name": "datasource/<data_source_id>/items/titanic", "acl": { "readers": [ { "gsuitePrincipal": { "gsuiteDomain": true } } ] }, "metadata": { "title": "Titanic", "viewUrl": "http://www.imdb.com/title/tt2234155/", "objectType": "movie" }, "structuredData": { "object": { "properties": [ { "name": "movieTitle", "textValues": { "values": ["Titanic"] } } ] } }, "content": { "inlineContent": "A seventeen-year-old aristocrat falls in love...", "contentFormat": "TEXT" }, "version": "01", "itemType": "CONTENT_ITEM" }(Optional) Use
items.getto verify indexing.
Handle repository changes
Periodically reindex the entire repository for full indexing. For list or graph
traversal, use the Google Cloud Indexing Queue
to track changes and only index what has changed. Use items.push to add items
to the queue.