Deploy a CSV Connector

This guide is intended for Google Cloud Search CSV (comma-separated values) connector administrators, that is, anyone who is responsible for downloading, configuring, running, and monitoring the connector.

This guide includes instructions for performing key tasks related to CSV connector deployment:

Download the Google Cloud Search CSV connector software
Configure the connector for use with a specific CSV data source
Deploy and run the connector

To understand the concepts in this document, you should be familiar with the fundamentals of Google Workspace, CSV files, and Access Control Lists (ACLs).

Overview of the Google Cloud Search CSV connector

The Cloud Search CSV connector works with any comma-separated values (CSV) text file. A CSV file stores tabular data, and each line of the file is a data record.

Google Cloud Search's CSV Connector extracts individual rows from a CSV file and indexes them into Cloud Search via Cloud Search's Indexing API. Once successfully indexed, individual rows from CSV files are searchable through Cloud Search's clients or Cloud Search's Query API. The CSV connector also supports controlling users' access to content in the search results, by using ACLs.

Google Cloud Search CSV connector can be installed on Linux or Windows. Before you deploy the Google Cloud Search CSV connector, ensure that you have the following required components:

Java JRE 1.8 installed on a computer that runs the Google Cloud Search CSV connector
Google Workspace information required to establish relationships between Google Cloud Search and the data source:
- Google Workspace private key (which contains the service account ID)
- Google Workspace data source ID
Typically, the Google Workspace administrator for the domain can supply these credentials for you.

Deployment steps

To deploy the Google Cloud Search CSV connector follow these steps:

Install the Google Cloud Search CSV connector software
Specify the CSV connector configuration
Configure access to the Google Cloud Search data source
Configure CSV file access
Specify columns names to index, unique key columns, and datetime columns
Specify columns to use in clickable search result URLs
Specify metadata information, column formats
Schedule data traversal
Specify Access Control List (ACL) options

1. Install the SDK

Install the SDK into your local Maven repository.

Clone the SDK repository from GitHub.

$ git clone https://github.com/google-cloudsearch/connector-sdk.git
$ cd connector-sdk/csv

Check out the desired version of the SDK:
```
$ git checkout tags/v1-0.0.3
```
Build the connector:
```
$ mvn package
```

Copy the connector zip file to your local installation directory:

$ cp target/google-cloudsearch-csv-connector-v1-0.0.3.zip installation-dir
$ cd installation-dir
$ unzip google-cloudsearch-csv-connector-v1-0.0.3.zip
$ cd google-cloudsearch-csv-connector-v1-0.0.3

2. Specify the CSV connector configuration

As the connector administrator, you control the CSV connector's behavior and attributes defining parameters in the connector's configuration file. Configurable parameters include:

Access to a data source
Location of the CSV file
CSV column definitions
Column(s) that define a unique id
Traversal options
ACL options to restrict data access

For the connector to properly access a CSV file and index the relevant content, you must first create its configuration file.

To create a configuration file:

Open a text editor of your choice and name the configuration file.
Add key=value pairs to the file contents as described in the following sections.
Save and name the configuration file.
Google recommends that you name the configuration file connector-config.properties so no additional command line parameters are required to run connector.

Because you can specify the configuration file path on the command line, a standard file location is not necessary. However, keep the configuration file in the same directory as the connector to simplify tracking and running the connector.

To ensure the connector recognizes your configuration file, specify its path on the command line. Otherwise, the connector uses connector-config.properties in your local directory as the default file name. For information about specifying the configuration path on the command-line, see Run the Cloud Search CSV connector.

3. Configure access to the Google Cloud Search data source

The first parameters every configuration file must specify are the ones necessary to access the Cloud Search data source, as shown in the following table. Typically, you will need the Data source ID, service account ID, and the path to the service account's private key file in order to configure the connector's access to Cloud Search. The steps required to set up a data source are described in Manage third-party data sources

Setting	Parameter
Data source ID	`api.sourceId=1234567890abcdef` Required. The Google Cloud Search source ID set up by the Google Workspace administrator, as described in Manage third-party data sources.
Path to the service account private key file	`api.serviceAccountPrivateKeyFile=./PrivateKey.json` Required. The Google Cloud Search service account key file for Google Cloud Search CSV connector accessibility.
Identity source ID	`api.identitySourceId=x0987654321` Required if using external users and groups. The Google Cloud Search identity source ID set up by the Google Workspace administrator.

4. Configure CSV file parameters

Before the connector can traverse a CSV file and extract data from it for indexing, you must identify the path to the file. You can also specify the file format and type of file encoding. Add the following parameters to specify the CSV file properties in the configuration file.

Setting	Parameter
Path to the CSV file	`csv.filePath=./movie_content.csv` Required. The path to the CSV file to be accessed and extract content for indexing.
File format	`csv.format=DEFAULT` The format of the file. Possible values are from the Apache Commons CSV CSVFormat class. Format values include: `DEFAULT`, `EXCEL`, `INFORMIX_UNLOAD`, `INFORMIX_UNLOAD_CSV`, `MYSQL`, `RFC4180`, `ORACLE`, `POSTGRESQL_CSV`, `POSTGRESQL_TEXT`, and `TDF`. If unspecified, Cloud Search uses `DEFAULT`.
File format modifier	`csv.format.withMethod=value` A modification to how Cloud Search handles the file. Possible methods are from the Apache Commons CSV CSVFormat class and include those that take a single character, string, or boolean value. For example, to specify a semicolon as a delimiter, use `csv.format.withDelimiter=;`. To ignore empty lines, use `csv.format.withIgnoreEmptyLines=true`.
File encoding type	`csv.fileEncoding=UTF-8` The Java character set to use when Cloud Search reads the file. If unspecified, Cloud Search uses the platform default character set.

5. Specify column names to index and unique key columns

For the connector to access and index CSV files, you must provide information about column definitions in the configuration file. If the configuration file does not contain the parameters that specify the column names to index and unique key columns, default values are used.

Setting	Parameter
Columns to index	`csv.csvColumns=movieId,movieTitle,description,actors,releaseDate,year,userratings...` The column names to be indexed from the CSV file. If `csv.csvColumns` is not set, then the first row of the CSV file is used as the header. If `csv.csvColumns` is set, then it takes precedence over the first row of the CSV. If you have set `csv.csvColumns` and the first row of the CSV file is a list of column names, then you need to set `csv.skipHeaderRecord=true` to avoid trying to index the first row as data. Default values are the columns in the header row in the file.
Unique key columns	`csv.uniqueKeyColumns=movieId` The CSV column(s) whose values will be used to generate each record's unique ID. If not specified, the hash of the CSV record should be used as its unique key. Default value is the record's hashcode.

6. Specify columns to use in clickable search result URLs

When a user searches using Google Cloud Search, it responds by showing a results page that includes clickable URLs for each result. To enable this feature, you must add the parameter shown in the following table to the configuration file.

Setting	Parameter
Search result URL format	`url.format=https://mymoviesite.com/movies/{0}` Required. The format to construct view URL for CSV content.
Search results URL parameters.	`url.columns=movieId` Required. The CSV column names whose values will be used to generate the record's view url.
Search results URL parameters to escape	`url.columnsToEscape=movieId` Optional. The CSV column names whose values will be URL escaped to generate valid view url.

7. Specify metadata information, column formats, search quality

You can add parameters to the configuration file that specify:

Metadata Configuration Parameters
Column formats
Search quality

Metadata Configuration Parameters

Metadata Configuration Parameters describes the CSV columns used for populating item metadata. If the configuration file does not contain these parameters, default values are used. The following table shows these parameters.

Setting	Parameter
Title	`itemMetadata.title.field=movieTitle` `itemMetadata.title.defaultValue=Gone with the Wind` The metadata attribute that contains the value corresponding to the document title. The default value is an empty string.
URL	`itemMetadata.sourceRepositoryUrl.field=url` `itemMetadata.sourceRepositoryUrl.defaultValue=https://www.imdb.com/title/tt0031381/` The metadata attribute that contains the value for the document URL for search results.
Created timestamp	`itemMetadata.createTime.field=releaseDate` `itemMetadata.createTime.defaultValue=1940-01-17` The metadata attribute that contains the value for the document creation timestamp.
Last modified time	`itemMetadata.updateTime.field=releaseDate` `itemMetadata.updateTime.defaultValue=1940-01-17` The metadata attribute that contains the value for the last modification timestamp for the document.
Document language	`itemMetadata.contentLanguage.field=languageCode` `itemMetadata.contentLanguage.defaultValue=en-US` The content language for documents being indexed.
Schema object type	`itemMetadata.objectType.field=type` `itemMetadata.objectType.defaultValue=movie` The object type used by the connector, as defined in the schema. The connector won't index any structured data if this property is not specified.

Datetime formats

Datetime formats specify the formats expected in metadata attributes. If the configuration file does not contain this parameter, default values are used. The following table shows this parameter.

Setting	Parameter
Additional datetime formats	`structuredData.dateTimePatterns=MM/dd/uuuu HH:mm:ssXXX` A semicolon-separated list of additional java.time.format.DateTimeFormatter patterns. The patterns are used when parsing string values for any date or date-time fields in the metadata or schema. The default value is an empty list, but RFC 3339 and RFC 1123 formats are always supported.

Column formats

Column formats specify information about the column(s) that should be a part of the searchable content. If the configuration file does not contain these parameters, default values are used. The following table shows these parameters.

Setting	Parameter
Skip header	`csv.skipHeaderRecord=true` Boolean. Ignore the header record (first line) in the CSV file. If you have set `csv.csvColumns` and the CSV file has a header row, then you must set `skipHeaderRecord=true`. This prevents indexing the first row in the file as data. If the CSV file does not have a header row, set `skipHeaderRecord=false`. The default value is false.
Multi-value columns	`csv.multiValueColumns=genre,actors` The column names in the CSV file that have multiple values. The default value is an empty string.
Delimiter for multi-value columns	`csv.multiValue.genre=;` The delimiter for the multi-value columns. The default delimiter is a comma.

Search quality

The Cloud Search CSV connector allows automatic HTML formatting for data fields. Your connector defines the data fields at the beginning of connector execution, and then uses a content template to format each data record before uploading it to Cloud Search.

The content template defines the importance of each field value for searching. The title field is required and is defined as the highest priority. You can designate search quality importance levels for all the other content fields: high, medium or low. Any content field not defined in a specific category defaults to low priority. The following table shows these parameters.

Setting	Parameter
Content title	contentTemplate.csv.title=`movieTitle` The content title is the highest search quality field.
High search quality for content fields	contentTemplate.csv.quality.high=`actors` Content fields given a high search quality value. The default is an empty string.
Low search quality for content fields	contentTemplate.csv.quality.low=`genre` Content fields given a low search quality value. The default is an empty string.
Medium search quality for content fields	contentTemplate.csv.quality.medium=`description` Content fields given a medium search quality value. The default is an empty string.
Unspecified content fields	contentTemplate.csv.unmappedColumnsMode=`IGNORE` How the connector handles unspecified content fields. Valid values are: APPEND—append unspecified content fields to the template IGNORE—ignore unspecified content fields The default value is APPEND.

8. Schedule data traversal

Traversal is the connector's process for discovering content from the data source, in this case, a CSV file. As the CSV connector runs, it will traverse the rows of a CSV file, and index each row to Cloud Search via the Indexing API.

Full traversal indexes all columns in the file. Incremental traversal only indexes columns that are added or modified since the previous traversal. The CSV connector only performs full traversals. It does not perform incremental traversals.

The scheduling parameters determine how often the connector waits between traversals. If the configuration file does not contain scheduling parameters, default values are used. The following table shows these parameters.

Setting	Parameter
Full traversal after an interval	schedule.traversalIntervalSecs=`7200` The connector performs a full traversal after a specified interval. Specify the interval between traversals in seconds. The default value is 86400 (number of seconds in one day).
Full traversal at connector startup	schedule.performTraversalOnStart=`false` The connector performs a full traversal at connector startup, rather than waiting for the first interval to expire. The default value is true.

9. Specify Access Control List (ACL) options

Google Cloud Search CSV connector supports permissions through ACLs to control access to the content of the CSV file in search results. There are multiple ACL options available to allow you to protect user access to indexed records.

If your repository has individual ACL information associated with each document, upload all ACL information to control document access within Cloud Search. If your repository provides partial or no ACL information, you can supply default ACL information in the following parameters, which the SDK provides to the connector.

The connector relies on default ACLs being enabled in the configuration file. To enable default ACLs, set defaultAcl.mode to any mode other than none and configure it with defaultAcl.*

Setting	Parameter
ACL mode	defaultAcl.mode=fallback Required. CSV connector rely on Default ACL functionality. Connector supports only fallback mode.
Default ACL Name	defaultAcl.name=`VIRTUAL_CONTAINER_FOR_CONNECTOR_1` Optional. Allows to override virtual container name used by connector to setup default ACLs. Default value is "DEFAULT_ACL_VIRTUAL_CONTAINER". You may want to override this value if multiple connectors are indexing content in same datasource.
Default public ACL	defaultAcl.public=`true` The default ACL used for the entire repository is set to public domain access. The default value is false.
Common ACL group readers	defaultAcl.readers.groups=google:`group1, group2`
Common ACL readers	defaultAcl.readers.users=`user1, user2, google:user3`
Common ACL denied group readers	defaultAcl.denied.groups=`group3`
Common Acl denied readers	defaultAcl.denied.users=`user4, user5`
Entire domain access	To specify that every indexed record be publicly accessible by every user in the domain, set both of the following options with values: defaultAcl.mode=fallback defaultAcl.public=true
Common defined ACL	To specify one ACL for each record of the data repository, set all of the following parameter values: defaultAcl.mode=fallback defaultAcl.public=false defaultAcl.readers.groups=google:`group1, group2` defaultAcl.readers.users=`user1, user2, google:user3` defaultAcl.denied.groups=`group3` defaultAcl.denied.users=`user4, user5` Every specified user and group is assumed to be a local domain-defined user/group unless prefixed with "google:" (literal constant). The default user or group is an empty string. Supply user and group options only if defaultAcl.public is set to false. To list multiple groups and users, use comma-delimited list. If defaultAcl.mode is set to none, records are unsearchable without defined individual ACLs.

Schema Definition

Cloud Search allows indexing and serving of structured and unstructured content. In order to support structured data queries on your data, you need to setup Schema for your datasource.

Once defined, CSV Connector can refer defined schema to build indexing requests. To provide an illustrative example, let's consider a CSV file containing information about Movies.

Let's assume, input CSV file has following content.

movieId
movieTitle
description
year
releaseDate
actors (multiple values separated by comma (,))
genre (multiple values)
ratings

Based on above structure of data, you can define schema for a datasource under which you want to index data from CSV file.

{
  "objectDefinitions": [
    {
      "name": "movie",
      "propertyDefinitions": [
        {
          "name": "actors",
          "isReturnable": true,
          "isRepeatable": true,
          "isFacetable": true,
          "textPropertyOptions": {
            "operatorOptions": {
              "operatorName": "actor"
            }
          }
        },
        {
          "name": "releaseDate",
          "isReturnable": true,
          "isRepeatable": false,
          "isFacetable": false,
          "datePropertyOptions": {
            "operatorOptions": {
              "operatorName": "released",
              "lessThanOperatorName": "releasedbefore",
              "greaterThanOperatorName": "releasedafter"
            }
          }
        },
        {
          "name": "movieTitle",
          "isReturnable": true,
          "isRepeatable": false,
          "isFacetable": false,
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": "HIGHEST"
            },
            "operatorOptions": {
              "operatorName": "title"
            }
          }
        },
        {
          "name": "genre",
          "isReturnable": true,
          "isRepeatable": true,
          "isFacetable": true,
          "enumPropertyOptions": {
            "operatorOptions": {
              "operatorName": "genre"
            },
            "possibleValues": [
              {
                "stringValue": "Action"
              },
              {
                "stringValue": "Documentary"
              },
              {
                "stringValue": "Drama"
              },
              {
                "stringValue": "Crime"
              },
              {
                "stringValue": "Sci-fi"
              }
            ]
          }
        },
        {
          "name": "userRating",
          "isReturnable": true,
          "isRepeatable": false,
          "isFacetable": true,
          "integerPropertyOptions": {
            "orderedRanking": "ASCENDING",
            "maximumValue": "10",
            "operatorOptions": {
              "operatorName": "score",
              "lessThanOperatorName": "scorebelow",
              "greaterThanOperatorName": "scoreabove"
            }
          }
        }
      ]
    }
  ]
}

Example configuration file

The following example configuration file shows the parameter key=value pairs that define an example connector's behavior.

# data source access
api.sourceId=1234567890abcd
api.serviceAccountPrivateKeyFile=./PrivateKey.json

# CSV data structure
csv.filePath=./movie_content.csv
csv.csvColumns=movieId,movieTitle,description,releaseYear,genre,actors,ratings,releaseDate
csv.skipHeaderRecord=true
url.format=https://mymoviesite.com/movies/{0}
url.columns=movieId
csv.datetimeFormat.releaseDate=yyyy-mm-dd
csv.multiValueColumns=genre,actors
csv.multiValue.genre=;
contentTemplate.csv.title=movieTitle

# metadata structured data and content
itemMetadata.title.field=movieTitle
itemMetadata.createTime.field=releaseDate
itemMetadata.contentLanguage.defaultValue=en-US
itemMetadata.objectType.defaultValue=movie
contentTemplate.csv.quality.medium=description
contentTemplate.csv.unmappedColumnsMode=IGNORE

#ACLs
defaultAcl.mode=fallback
defaultAcl.public=true

For detailed descriptions of each parameter, see the Configuration parameters reference.

Run the Cloud Search CSV connector

To run the connector from the command line, type the following command:

$ java -jar google-cloudsearch-csv-connector-v1-0.0.3.jar -Dconfig=my.config

By default, connector logs are available on standard output. You can log to files by specifying logging.properties.