Warning: The Cloud Search reference connectors are provided "as is" as sample code for use in creating your own working connectors. This sample code requires substantial customization and testing before being used in proof-of-concept or production environments. For production use, we highly recommend obtaining help from one of our Cloud Search partners. For further help finding a suitable Cloud Search partner, contact your Google Account Manager. |
You can set up Google Cloud Search to discover and index data from your organization's databases by using the Google Cloud Search database connector.
Important considerations
You can install and run the Cloud Search database connector in almost any environment where Java apps can run, so long as the connector has access to both the internet and the database.
System requirements
System requirements | |
---|---|
Operating system | Windows or Linux |
SQL database | Any SQL database with a JDBC 4.0 or later compliant driver, including the following:
|
Software | JDBC driver for the connector to use to access the database (downloaded and installed separately) |
Deploy the connector
The following steps describe how to install the connector and configure it to index the specified databases and return the results to Cloud Search users.
Prerequisites
Before you deploy the Cloud Search database connector, gather the following information:
- Google Workspace private key, which also contains the service account ID. To learn how to get a private key, go to Configure access to the Google Cloud Search REST API.
- Google Workspace data source ID. To learn how to get a data source ID, go to Add a data source to search.
Step 1. Download and build the database connector software
- Clone the connector repository from GitHub.
$ git clone https://github.com/google-cloudsearch/database-connector.git $ cd database-connector
- Check out the desired version of the connector:
$ git checkout tags/v1-0.0.3
- Build the connector.
$ mvn package
To skip the tests when building the connector, usemvn package -DskipTests
. - Copy the connector zip file to your local installation directory and unzip it:
$ cp target/google-cloudsearch-database-connector-v1-0.0.3.zip installation-dir $ cd installation-dir $ unzip google-cloudsearch-database-connector-v1-0.0.3.zip $ cd google-cloudsearch-database-connector-v1-0.0.3
Step 2. Configure the database connector
- Create a text file and name it
connector-config.properties
(the default) or similar. Google recommends that you name configuration files with the.properties
or.config
extension and keep the file in the same directory as the connector. If you use a different name or path, you must specify the path when you run the connector. - Add parameters as key/value pairs to the file contents. The configuration file must specify
the parameters for data source access, database access, a database full traversal SQL statement,
a content field title, and column definitions. You can also configure other connector behavior
with optional parameters. For example:
# Required parameters for data source access api.sourceId=1234567890abcdef api.identitySourceId=0987654321lmnopq api.serviceAccountPrivateKeyFile=./PrivateKey.json # # Required parameters for database access db.url=jdbc:mysql://localhost:3306/mysql_test db.user=root db.password=passw0rd # # Required full traversal SQL statement parameter db.allRecordsSql=select customer_id, first_name, last_name, phone, change_timestamp from address_book # # Required parameters for column definitions and URL format db.allColumns=customer_id, first_name, last_name, phone, change_timestamp db.uniqueKeyColumns=customer_id url.columns=customer_id # # Required content field parameter contentTemplate.db.title=customer_id # # Optional parameters to set ACLs to "entire domain" access defaultAcl.mode=fallback defaultAcl.public=true # # Optional parameters for schedule traversals schedule.traversalIntervalSecs=36000 schedule.performTraversalOnStart=true schedule.incrementalTraversalIntervalSecs=3600
For detailed descriptions of the database-specific parameters, go to the Configuration parameters reference at the end of this article.
To learn about the parameters that are common to all Cloud Search connectors, such as metadata configuration, datetime formats, and ACL options, go to Google-supplied connector parameters.
If applicable, specify properties of the schema object in the traversal SQL query parameters. Usually you can add aliases to the SQL statement. For example, if you have a movie database and the data source schema contains a property definition named "ActorName", a SQL statement could have the form:
SELECT …, last_name AS ActorName, … FROM …
.
Step 3. Run the database connector
The following example assumes the required components are located in the local directory on a Linux system.
To run the connector from the command line, enter the following command:
java \ -cp "google-cloudsearch-database-connector-v1-0.0.3.jar:mysql-connector-java-5.1.41-bin.jar" \ com.google.enterprise.cloudsearch.database.DatabaseFullTraversalConnector \ [-Dconfig=mysql.config]
Where:
google-cloud-search-database-connector-v1-0.0.3.jar
is the database connector .jar filemysql-connector-java-5.1.41-bin.jar
is the JDBC driver being used to access the databasemysql.config
is a custom-named configuration file. To ensure the connector recognizes your configuration file, specify its path on the command line. Otherwise, the connector usesconnector-config.properties
in your local directory as the default filename.
The connector reports configuration errors as it detects them. Some errors are reported when
the connector initializes, such as when a database column is defined as part of the record content
(in db.allColumns
), but the column isn't used in the traversal SQL query of the
database (in db.allRecordsSql
). Other errors are only detected and reported when
the connector attempts to access the database for the first traversal, such as invalid SQL statement syntax.
Configuration parameters reference
Data source access parameters
Setting | Parameter |
---|---|
Data source ID | api.sourceId = source-ID
Required. The Cloud Search source ID that the Google Workspace administrator set up. |
Identity source ID | api.identitySourceId = identity-source-ID
Required to use external users and groups for ACLs. The Cloud Search identity source ID that the Google Workspace administrator set up. |
Service account | api.serviceAccountPrivateKeyFile = path-to-private-key
Required. The path to the Cloud Search service account key file that the Google Workspace administrator createed. |
Database access parameters
Setting | Parameter |
---|---|
Database URL | db.url = database-URL
Required. The
full path of the database to be accessed, such as |
Database username and password | db.user = username db.password = password
Required. A valid username and password that the connector uses to access the database. This database user must have read access to the relevant records of the database being read. |
JDBC driver | db.driverClass = oracle.jdbc.OracleDriver
Required only if the JDBC 4.0 driver is not already specified in the class path. |
Traversal SQL query parameters
The connector traverses database records with SQL SELECT queries in the configuration file. You must configure a full traversal query; queries for incremental traversals are optional.
A full traversal reads every database record configured for indexing. A full traversal is required to index new records for Cloud Search and also to re-index all existing records.
An incremental traversal reads and re-indexes only newly modified database records and recent entries to the database. Incremental traversals can be more efficient than full traversals. To use incremental traversals, your database must contain timestamp fields to indicate modified records.
The connector executes these traversals according to the schedules you define in traversal schedule parameters.
Setting | Parameter |
---|---|
Full traversal query | db.allRecordsSql = SELECT column-1[, column-2,...] FROM database-name
Required. The query run for every full traversal. Every column name that the connector will use in any capacity (content, unique ID, ACLs) must be present in this query. The connector performs some preliminary verifications at startup to detect errors and omissions. For this reason, do not use a general "SELECT * FROM …" query. |
Full traversal pagination | db.allRecordsSql.pagination = {none | offset}
Value can be:
|
Incremental traversal query | db.incrementalUpdateSql = SELECT column-1[, column-2,...] FROM database-name WHERE last_update_time > ?
Required if you schedule incremental traversals. The "?" in the query is a mandatory placeholder for a timestamp value. The connector uses the timestamp to track modifications between incremental traversal SQL queries. To track the database timestamp column for the last update time, add the
For the first incremental traversal, the connector uses the start time of the connector. After the first incremental traversal, Cloud Search stores the timestamp so that connector restarts are able to access the previous incremental traversal timestamp. |
Database time zone | db.timestamp.timezone = America/Los_Angeles
Specifies the time zone to use for database timestamps. The database timestamp used to identify new record additions or newly modified database records. The default is the local time zone where the connector is running. |
Traversal SQL query examples
- Basic full traversal query that reads every record of interest in an employee database for indexing:
db.allRecordsSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field \ FROM employee
- Specify pagination by offset, and break up a full traversal into multiple queries.
For SQL Server 2012 or Oracle 12c (standard SQL 2008 syntax):
db.allRecordsSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field \ FROM employee \ ORDER BY customer_id OFFSET ? ROWS FETCH FIRST 1000 ROWS ONLY db.allRecordsSql.pagination = offset
or, for MySQL or Google Cloud SQL:
db.allRecordsSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field \ FROM employee \ ORDER BY customer_id LIMIT 1000 OFFSET ? db.allRecordsSql.pagination = offset
- Full traversal query that applies individual ACLs with aliases:
db.allRecordsSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field, last_update_time, \ permitted_readers AS readers_users, \ denied_readers AS denied_users, \ permitted_groups AS readers_groups, \ denied_groups AS denied_groups \ FROM employee
- Basic incremental traversal query:
db.incrementalUpdateSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field, last_update_time \ FROM employee \ WHERE last_update_time > ?
- Incremental traversal query that applies individual ACLs with aliases:
db.incrementalUpdateSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field, last_update_time, \ permitted_readers AS readers_users, \ denied_readers AS denied_users, \ permitted_groups AS readers_groups, \ denied_groups AS denied_groups \ FROM employee \ WHERE last_update_time > ?
- Incremental traversal query that uses the database timestamp rather than the current time:
db.incrementalUpdateSql = SELECT customer_id, first_name, last_name, employee_id, interesting_field, \ last_update_time AS timestamp_column \ FROM employee \ WHERE last_update_time > ?
Column definition parameters
The following parameters specify the columns that you use in the traversal statements and to uniquely identify each record.
Setting | Parameter |
---|---|
All columns | db.allColumns = column-1, column-2, ...column-N
Required. Identifies all the columns that are required in a SQL query when accessing the database. The columns defined with this parameter must be explicitly referenced in the queries. Every other column definition parameter is compared against this set of columns. Example: db.allColumns = customer_id, first_name, last_name, phone, change_timestamp |
Unique key columns | db.uniqueKeyColumns = column-1[, column-2]
Required. Lists either a single database column that contains unique values or by a combination of columns whose values together define a unique ID. Cloud Search requires every searchable document to have a unique identifier within a data source. You must be able to define a unique ID for each database record from column values. If you run multiple connectors on separate databases but index into a common dataset, make sure that you specify a unique ID across all documents. Examples: db.uniqueKeyColumns = customer_id # or db.uniqueKeyColumns = last_name, first_name |
URL link column | url.columns = column-1[, column-2]
Required. Specifies one or more valid, defined names of the columns used for the URL used for a clickable search result. For databases that have no relevant URL associated with each database record, a static link can be used for every record. However, if the column values do define a valid link for each record, the view URL columns and format configuration values should be specified. |
URL format | url.format = https://www.example.com/{0}
Defines the format of the view URL. Numbered parameters refer to the columns specified in db.columns, in order, starting with zero. If not specified, the default is "{0}." Examples follow this table. |
Percent-encoded columns for URL | url.columnsToEscape = column-1[, column-2]
Specifies columns from db.columns whose values will be percent-encoded before including them in the formatted URL string. |
URL column examples
To specify the columns used in traversal queries and the format of the view URL:
- To use a static URL not using any database record values:
url.format = https://www.example.com
- To use a single column value that is the view URL:
url.format = {0} url.columns = customer_id
- To use a single column value that is substituted into the view URL at position {0}:
url.format = https://www.example.com/customer/id={0} url.columns = customer_id url.columnsToEscape = customer_id
- To use multiple column values to build the view URL (columns are order-dependent):
url.format = {1}/customer={0} url.columns = customer_id, linked_url url.columnsToEscape = customer_id
Content fields
Use the content options to define which record values should be made part of the searchable content.
Setting | Parameter |
---|---|
Highest-quality search column | contentTemplate.db.title = column-name
Required. The highest-quality column for search indexing and result prioritization. |
Column prioritization for search | contentTemplate.db.quality.high = column-1[, column-2...] contentTemplate.db.quality.medium = column-1[, column-2...] contentTemplate.db.quality.low = column-1[, column-2...]
Designate content columns (except the column set for |
Content data columns | db.contentColumns = column-1[, column-2...]
Specify content columns in the database. These are formatted and uploaded to Cloud Search as searchable document content. If you don't specify a value, the default is "*" indicating that all columns should be used for content. |
Blob column | db.blobColumn = column-name
Specify the name of a single blob column to use for document content instead of a combination of content columns. If a blob column is specified, then it is considered an error if content columns are also defined. However, metadata and structured data column definitions are still allowed along with blob columns. |