The Connector SDK and Google Cloud Search API allow the creation of Cloud Search Indexing Queues used to perform the following tasks:
Maintain the per-document state (status, hash values, and so on) which can be used to keep your index in sync with your repository.
Maintain a list of items to be indexed as discovered during the traversal process.
Prioritize items in queues based on item status.
Maintain additional state information for efficient integration such as checkpoints, change token, and so on.
A queue is a label assigned to an indexed item, such as "default" for the default queue or "B" for queue B.
Status & priority
A document’s priority in a queue is based on its
ItemStatus
code. Following are the possible
ItemStatus
codes in order of priority (handled first to handled last):
ERROR
- Item encountered asynchronous error during the indexing process and needs to be re-indexed.MODIFIED
- Item that was previously indexed and has since been modified in the repository since the last indexing.NEW_ITEM
- Item that is not indexed.ACCEPTED
- Document that was previously indexed and has not changed in the repository since the last indexing.
When two items in a queue have the same status, higher priority is given to the items that have been in the queue for the longest period of time.
Overview of using indexing queues to index a new or changed item
Figure 1 shows the steps in indexing a new or changed item using an indexing queue. These steps show REST API calls. For equivalent SDK calls, refer to Queue operations (Connector SDK).
The content connector uses
items.push
to push items (metadata and hash) into an indexing queue to establish the item's status (MODIFIED
,NEW_ITEM
,DELETED
). Specifically:- When pushing, the connector explicitly includes a push
type
orcontentHash
. - If the connector doesn't include the
type
, then Cloud Search automatically uses thecontentHash
to determine the item's status. - If the item is unknown, the item status is set to
NEW_ITEM
. - If the item exists and hash values match, the status is kept as
ACCEPTED
. - If the item exists and the hashes differ, the status becomes
MODIFIED
.
For more information on how item status is established, refer to the Traversing the GitHub repositories sample code in the Cloud Search getting started tutorial.
Usually, the push is associated with content traversal and/or change detection processes in the connector.
- When pushing, the connector explicitly includes a push
The content connector uses
items.poll
to poll the queue to determine items to index. Cloud Search tells the connector which items are most in need of indexing, sorted first by status code and then by time-in-queue.The connector retrieves these items from the repository and builds index API requests.
The connector uses
items.index
to index the items. The item only enters theACCEPTED
state after Cloud Search successfully finishes processing the item.
A connector can also delete an item if it no longer exists in the repository, or push an item again if it's not modified or if there is a source repository error. For information on item deletions, see the next section.
Overview of using indexing queues to delete an item
The full-traversal strategy uses a two-queue process to index items and detect deletions. Figure 2 shows the steps in deleting an item using two indexing queues. Specifically, Figure 2 shows the second traversal performed using a full-traversal strategy. These steps use the REST API calls. For equivalent SDK calls, refer to Queue operations (Connector SDK).
On initial traversal, the content connector uses
items.push
to push items (metadata and hash) into an indexing queue, "queue A" asNEW_ITEM
as it doesn't exist in the queue. Each item is assigned the label "A" for "queue A." The content is indexed into Cloud Search.The content connector uses
items.poll
to poll queue A to determine items to index. Cloud Search tells the connector which items are most in need of indexing, sorted first by status code and then by time-in-queue.The connector retrieves these items from the repository and builds index API requests.
The connector uses
items.index
to index the items. The item only enters theACCEPTED
state after Cloud Search successfully finishes processing the item.The
deleteQueueItems
method is called on "queue B." But, no items have been pushed to queue B, so nothing can be deleted.On the second full traversal, the content connector uses
items.push
to push items (metadata and hash) into queue B:- When pushing, the connector explicitly includes a push
type
orcontentHash
. - If the connector doesn't include the
type
, then Cloud Search automatically uses thecontentHash
to determine the item's status. - If the item is unknown, the item status is set to
NEW_ITEM
and the queue label is changed to "B." - If the item exists and hash values match, the status is kept as
ACCEPTED
and the queue label is changed to "B." - If the item exists and the hashes differ, the status becomes
MODIFIED
and the queue label is changed to "B."
- When pushing, the connector explicitly includes a push
The content connector uses
items.poll
to poll the queue to determine items to index. Cloud Search tells the connector which items are most in need of indexing, sorted first by status code and then by time-in-queue.The connector retrieves these items from the repository and builds index API requests.
The connector uses
items.index
to index the items. The item only enters theACCEPTED
state after Cloud Search successfully finishes processing the item.Finally,
deleteQueueItems
is called on queue A to delete all previously indexed CCloud Search items that still have a queue "A" label.With subsequent full traversals, the queue used for indexing and the queue used for deleting are swapped.
Queue operations (Connector SDK)
The Content Connector SDK provides operations for pushing items to, and pulling items from, a queue.
To package and push an item to a queue, use the pushItems
builder class.
You do not need to do anything specific to pull items from a queue for
processing. Instead, the SDK automatically pulls items from the queue, in priority
order, using the
Repository class's
getDoc
method.
Queue operations (REST API)
The REST API provides the following two methods for pushing items to and pulling items from a queue:
- To push an item to a queue, use
Items.push
. - To poll items in the queue, use
Items.poll
.
You can also use
Items.index
to push items to the queue during indexing. Items pushed to the queue during
indexing don’t require a
type
and are automatically assigned a status of
ACCEPTED
.
Items.push
The
Items.push
method adds IDs to the queue. This method can be called with a specific
type
value which determines the result of push operation. For a list of type
values, refer
to the
item.type
field in the Items.push
method.
Pushing a new ID results in adding a new entry with an NEW_ITEM
ItemStatus
code.
The optional payload is always stored, treated as an opaque value, and returned
from
Items.poll
.
When an item is polled, it is reserved meaning it cannot be returned by
another call to
Items.poll
.
Using
Items.push
with
type
as NOT_MODIFIED
, REPOSITORY_ERROR
, or REQUEUE
, unreserves
polled entries. For further information about reserved and unreserved entries,
refer to the Items.poll section..
Items.push
with hashes
The Google Cloud Search API supports specifying metadata and content hash values
on
Items.index
requests. Instead of specifying
type
,
the metadata and/or content hash values
can be specified with a push request. The Cloud Search Indexing Queue compares
the provided hash values with the stored values available with the item in the
data source. If mismatched, that entry is marked as MODIFIED
. If a corresponding
item doesn't exist in the index, then the status is NEW_ITEM
.
Items.poll
The Items.poll method retrieves the highest priority entries from the queue. The requested and returned status values indicate the status(es) of the priority queue(s) requested or the status of the returned IDs.
By default, entries from any section of the queue may be returned, based on
priority. Each returned entry is reserved, and is not returned by other
calls to
Items.poll
until one of the following cases is met:
- The reservation times out.
- The entry is enqueued again by
Items.index
. Items.push
is called with atype
value ofNOT_MODIFIED
,REPOSITORY_ERROR
, orREQUEUE
.