Deploy a CSV Connector

This guide is intended for Google Cloud Search CSV (comma-separated values) connector administrators, that is, anyone who is responsible for downloading, configuring, running, and monitoring the connector.

This guide includes instructions for performing key tasks related to CSV connector deployment:

  • Download the Google Cloud Search CSV connector software
  • Configure the connector for use with a specific CSV data source
  • Deploy and run the connector

To understand the concepts in this document, you should be familiar with the fundamentals of Google Workspace, CSV files, and Access Control Lists (ACLs).

Overview of the Google Cloud Search CSV connector

The Cloud Search CSV connector works with any comma-separated values (CSV) text file. A CSV file stores tabular data, and each line of the file is a data record.

Google Cloud Search's CSV Connector extracts individual rows from a CSV file and indexes them into Cloud Search via Cloud Search's Indexing API. Once successfully indexed, individual rows from CSV files are searchable through Cloud Search's clients or Cloud Search's Query API. The CSV connector also supports controlling users' access to content in the search results, by using ACLs.

Google Cloud Search CSV connector can be installed on Linux or Windows. Before you deploy the Google Cloud Search CSV connector, ensure that you have the following required components:

  • Java JRE 1.8 installed on a computer that runs the Google Cloud Search CSV connector
  • Google Workspace information required to establish relationships between Google Cloud Search and the data source:

    Typically, the Google Workspace administrator for the domain can supply these credentials for you.

Deployment steps

To deploy the Google Cloud Search CSV connector follow these steps:

  1. Install the Google Cloud Search CSV connector software
  2. Specify the CSV connector configuration
  3. Configure access to the Google Cloud Search data source
  4. Configure CSV file access
  5. Specify columns names to index, unique key columns, and datetime columns
  6. Specify columns to use in clickable search result URLs
  7. Specify metadata information, column formats
  8. Schedule data traversal
  9. Specify Access Control List (ACL) options

1. Install the SDK

Install the SDK into your local Maven repository.

  1. Clone the SDK repository from GitHub.

    $ git clone https://github.com/google-cloudsearch/connector-sdk.git
    $ cd connector-sdk/csv
  2. Check out the desired version of the SDK:

    $ git checkout tags/v1-0.0.3
  3. Build the connector:

    $ mvn package
  4. Copy the connector zip file to your local installation directory:

    $ cp target/google-cloudsearch-csv-connector-v1-0.0.3.zip installation-dir
    $ cd installation-dir
    $ unzip google-cloudsearch-csv-connector-v1-0.0.3.zip
    $ cd google-cloudsearch-csv-connector-v1-0.0.3

2. Specify the CSV connector configuration

As the connector administrator, you control the CSV connector's behavior and attributes defining parameters in the connector's configuration file. Configurable parameters include:

  • Access to a data source
  • Location of the CSV file
  • CSV column definitions
  • Column(s) that define a unique id
  • Traversal options
  • ACL options to restrict data access

For the connector to properly access a CSV file and index the relevant content, you must first create its configuration file.

To create a configuration file:

  1. Open a text editor of your choice and name the configuration file.
    Add key=value pairs to the file contents as described in the following sections.
  2. Save and name the configuration file.
    Google recommends that you name the configuration file connector-config.properties so no additional command line parameters are required to run connector.

Because you can specify the configuration file path on the command line, a standard file location is not necessary. However, keep the configuration file in the same directory as the connector to simplify tracking and running the connector.

To ensure the connector recognizes your configuration file, specify its path on the command line. Otherwise, the connector uses connector-config.properties in your local directory as the default file name. For information about specifying the configuration path on the command-line, see Run the Cloud Search CSV connector.

3. Configure access to the Google Cloud Search data source

The first parameters every configuration file must specify are the ones necessary to access the Cloud Search data source, as shown in the following table. Typically, you will need the Data source ID, service account ID, and the path to the service account's private key file in order to configure the connector's access to Cloud Search. The steps required to set up a data source are described in Manage third-party data sources

Setting Parameter
Data source ID api.sourceId=1234567890abcdef

Required. The Google Cloud Search source ID set up by the Google Workspace administrator, as described in Manage third-party data sources.

Path to the service account private key file api.serviceAccountPrivateKeyFile=./PrivateKey.json

Required. The Google Cloud Search service account key file for Google Cloud Search CSV connector accessibility.

Identity source ID api.identitySourceId=x0987654321

Required if using external users and groups. The Google Cloud Search identity source ID set up by the Google Workspace administrator.

4. Configure CSV file parameters

Before the connector can traverse a CSV file and extract data from it for indexing, you must identify the path to the file. You can also specify the file format and type of file encoding. Add the following parameters to specify the CSV file properties in the configuration file.

Setting Parameter
Path to the CSV file csv.filePath=./movie_content.csv

Required. The path to the CSV file to be accessed and extract content for indexing.

File format csv.format=DEFAULT

The format of the file. Possible values are from the Apache Commons CSV CSVFormat class.

Format values include: DEFAULT, EXCEL, INFORMIX_UNLOAD, INFORMIX_UNLOAD_CSV, MYSQL, RFC4180, ORACLE, POSTGRESQL_CSV, POSTGRESQL_TEXT, and TDF. If unspecified, Cloud Search uses DEFAULT.

File format modifier csv.format.withMethod=value

A modification to how Cloud Search handles the file. Possible methods are from the Apache Commons CSV CSVFormat class and include those that take a single character, string, or boolean value.

For example, to specify a semicolon as a delimiter, use csv.format.withDelimiter=;. To ignore empty lines, use csv.format.withIgnoreEmptyLines=true.

File encoding type csv.fileEncoding=UTF-8

The Java character set to use when Cloud Search reads the file. If unspecified, Cloud Search uses the platform default character set.

5. Specify column names to index and unique key columns

For the connector to access and index CSV files, you must provide information about column definitions in the configuration file. If the configuration file does not contain the parameters that specify the column names to index and unique key columns, default values are used.

Setting Parameter
Columns to index csv.csvColumns=movieId,movieTitle,description,actors,releaseDate,year,userratings...

The column names to be indexed from the CSV file. If csv.csvColumns is not set, then the first row of the CSV file is used as the header. If csv.csvColumns is set, then it takes precedence over the first row of the CSV. If you have set csv.csvColumns and the first row of the CSV file is a list of column names, then you need to set csv.skipHeaderRecord=true to avoid trying to index the first row as data. Default values are the columns in the header row in the file.

Unique key columns csv.uniqueKeyColumns=movieId

The CSV column(s) whose values will be used to generate each record's unique ID. If not specified, the hash of the CSV record should be used as its unique key. Default value is the record's hashcode.

6. Specify columns to use in clickable search result URLs

When a user searches using Google Cloud Search, it responds by showing a results page that includes clickable URLs for each result. To enable this feature, you must add the parameter shown in the following table to the configuration file.

Setting Parameter
Search result URL format url.format=https://mymoviesite.com/movies/{0}

Required. The format to construct view URL for CSV content.

Search results URL parameters. url.columns=movieId

Required. The CSV column names whose values will be used to generate the record's view url.

Search results URL parameters to escape url.columnsToEscape=movieId

Optional. The CSV column names whose values will be URL escaped to generate valid view url.

7. Specify metadata information, column formats, search quality

You can add parameters to the configuration file that specify:

Metadata Configuration Parameters

Metadata Configuration Parameters describes the CSV columns used for populating item metadata. If the configuration file does not contain these parameters, default values are used. The following table shows these parameters.

Setting Parameter
Title itemMetadata.title.field=movieTitle
itemMetadata.title.defaultValue=Gone with the Wind

The metadata attribute that contains the value corresponding to the document title. The default value is an empty string.

URL itemMetadata.sourceRepositoryUrl.field=url
itemMetadata.sourceRepositoryUrl.defaultValue=https://www.imdb.com/title/tt0031381/
The metadata attribute that contains the value for the document URL for search results.
Created timestamp itemMetadata.createTime.field=releaseDate
itemMetadata.createTime.defaultValue=1940-01-17

The metadata attribute that contains the value for the document creation timestamp.

Last modified time itemMetadata.updateTime.field=releaseDate
itemMetadata.updateTime.defaultValue=1940-01-17

The metadata attribute that contains the value for the last modification timestamp for the document.

Document language itemMetadata.contentLanguage.field=languageCode
itemMetadata.contentLanguage.defaultValue=en-US

The content language for documents being indexed.

Schema object type itemMetadata.objectType.field=type
itemMetadata.objectType.defaultValue=movie

The object type used by the connector, as defined in the schema. The connector won't index any structured data if this property is not specified.

Datetime formats

Datetime formats specify the formats expected in metadata attributes. If the configuration file does not contain this parameter, default values are used. The following table shows this parameter.

Setting Parameter
Additional datetime formats structuredData.dateTimePatterns=MM/dd/uuuu HH:mm:ssXXX
A semicolon-separated list of additional java.time.format.DateTimeFormatter patterns. The patterns are used when parsing string values for any date or date-time fields in the metadata or schema. The default value is an empty list, but RFC 3339 and RFC 1123 formats are always supported.

Column formats

Column formats specify information about the column(s) that should be a part of the searchable content. If the configuration file does not contain these parameters, default values are used. The following table shows these parameters.

Setting Parameter
Skip header csv.skipHeaderRecord=true

Boolean. Ignore the header record (first line) in the CSV file. If you have set csv.csvColumns and the CSV file has a header row, then you must set skipHeaderRecord=true. This prevents indexing the first row in the file as data. If the CSV file does not have a header row, set skipHeaderRecord=false. The default value is false.

Multi-value columns csv.multiValueColumns=genre,actors

The column names in the CSV file that have multiple values. The default value is an empty string.

Delimiter for multi-value columns csv.multiValue.genre=;

The delimiter for the multi-value columns. The default delimiter is a comma.

Search quality

The Cloud Search CSV connector allows automatic HTML formatting for data fields. Your connector defines the data fields at the beginning of connector execution, and then uses a content template to format each data record before uploading it to Cloud Search.

The content template defines the importance of each field value for searching. The title field is required and is defined as the highest priority. You can designate search quality importance levels for all the other content fields: high, medium or low. Any content field not defined in a specific category defaults to low priority. The following table shows these parameters.

Setting Parameter
Content title contentTemplate.csv.title=movieTitle

The content title is the highest search quality field.

High search quality for content fields contentTemplate.csv.quality.high=actors

Content fields given a high search quality value. The default is an empty string.

Low search quality for content fields contentTemplate.csv.quality.low=genre

Content fields given a low search quality value. The default is an empty string.

Medium search quality for content fields contentTemplate.csv.quality.medium=description

Content fields given a medium search quality value. The default is an empty string.

Unspecified content fields contentTemplate.csv.unmappedColumnsMode=IGNORE

How the connector handles unspecified content fields. Valid values are:

  • APPEND—append unspecified content fields to the template
  • IGNORE—ignore unspecified content fields

    The default value is APPEND.

8. Schedule data traversal

Traversal is the connector's process for discovering content from the data source, in this case, a CSV file. As the CSV connector runs, it will traverse the rows of a CSV file, and index each row to Cloud Search via the Indexing API.

Full traversal indexes all columns in the file. Incremental traversal only indexes columns that are added or modified since the previous traversal. The CSV connector only performs full traversals. It does not perform incremental traversals.

The scheduling parameters determine how often the connector waits between traversals. If the configuration file does not contain scheduling parameters, default values are used. The following table shows these parameters.

Setting Parameter
Full traversal after an interval schedule.traversalIntervalSecs=7200

The connector performs a full traversal after a specified interval. Specify the interval between traversals in seconds. The default value is 86400 (number of seconds in one day).

Full traversal at connector startup schedule.performTraversalOnStart=false

The connector performs a full traversal at connector startup, rather than waiting for the first interval to expire. The default value is true.

9. Specify Access Control List (ACL) options

Google Cloud Search CSV connector supports permissions through ACLs to control access to the content of the CSV file in search results. There are multiple ACL options available to allow you to protect user access to indexed records.

If your repository has individual ACL information associated with each document, upload all ACL information to control document access within Cloud Search. If your repository provides partial or no ACL information, you can supply default ACL information in the following parameters, which the SDK provides to the connector.

The connector relies on default ACLs being enabled in the configuration file. To enable default ACLs, set defaultAcl.mode to any mode other than none and configure it with defaultAcl.*

Setting Parameter
ACL mode defaultAcl.mode=fallback

Required. CSV connector rely on Default ACL functionality. Connector supports only fallback mode.

Default ACL Name defaultAcl.name=VIRTUAL_CONTAINER_FOR_CONNECTOR_1

Optional. Allows to override virtual container name used by connector to setup default ACLs. Default value is "DEFAULT_ACL_VIRTUAL_CONTAINER". You may want to override this value if multiple connectors are indexing content in same datasource.

Default public ACL defaultAcl.public=true

The default ACL used for the entire repository is set to public domain access. The default value is false.

Common ACL group readers defaultAcl.readers.groups=google:group1, group2
Common ACL readers defaultAcl.readers.users=user1, user2, google:user3
Common ACL denied group readers defaultAcl.denied.groups=group3
Common Acl denied readers defaultAcl.denied.users=user4, user5
Entire domain access To specify that every indexed record be publicly accessible by every user in the domain, set both of the following options with values:
  • defaultAcl.mode=fallback
  • defaultAcl.public=true
Common defined ACL To specify one ACL for each record of the data repository, set all of the following parameter values:
  • defaultAcl.mode=fallback
  • defaultAcl.public=false
  • defaultAcl.readers.groups=google:group1, group2
  • defaultAcl.readers.users=user1, user2, google:user3
  • defaultAcl.denied.groups=group3
  • defaultAcl.denied.users=user4, user5

    Every specified user and group is assumed to be a local domain-defined user/group unless prefixed with "google:" (literal constant).

    The default user or group is an empty string. Supply user and group options only if defaultAcl.public is set to false. To list multiple groups and users, use comma-delimited list.

    If defaultAcl.mode is set to none, records are unsearchable without defined individual ACLs.

Schema Definition

Cloud Search allows indexing and serving of structured and unstructured content. In order to support structured data queries on your data, you need to setup Schema for your datasource.

Once defined, CSV Connector can refer defined schema to build indexing requests. To provide an illustrative example, let's consider a CSV file containing information about Movies.

Let's assume, input CSV file has following content.

  1. movieId
  2. movieTitle
  3. description
  4. year
  5. releaseDate
  6. actors (multiple values separated by comma (,))
  7. genre (multiple values)
  8. ratings

Based on above structure of data, you can define schema for a datasource under which you want to index data from CSV file.

{
  "objectDefinitions": [
    {
      "name": "movie",
      "propertyDefinitions": [
        {
          "name": "actors",
          "isReturnable": true,
          "isRepeatable": true,
          "isFacetable": true,
          "textPropertyOptions": {
            "operatorOptions": {
              "operatorName": "actor"
            }
          }
        },
        {
          "name": "releaseDate",
          "isReturnable": true,
          "isRepeatable": false,
          "isFacetable": false,
          "datePropertyOptions": {
            "operatorOptions": {
              "operatorName": "released",
              "lessThanOperatorName": "releasedbefore",
              "greaterThanOperatorName": "releasedafter"
            }
          }
        },
        {
          "name": "movieTitle",
          "isReturnable": true,
          "isRepeatable": false,
          "isFacetable": false,
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": "HIGHEST"
            },
            "operatorOptions": {
              "operatorName": "title"
            }
          }
        },
        {
          "name": "genre",
          "isReturnable": true,
          "isRepeatable": true,
          "isFacetable": true,
          "enumPropertyOptions": {
            "operatorOptions": {
              "operatorName": "genre"
            },
            "possibleValues": [
              {
                "stringValue": "Action"
              },
              {
                "stringValue": "Documentary"
              },
              {
                "stringValue": "Drama"
              },
              {
                "stringValue": "Crime"
              },
              {
                "stringValue": "Sci-fi"
              }
            ]
          }
        },
        {
          "name": "userRating",
          "isReturnable": true,
          "isRepeatable": false,
          "isFacetable": true,
          "integerPropertyOptions": {
            "orderedRanking": "ASCENDING",
            "maximumValue": "10",
            "operatorOptions": {
              "operatorName": "score",
              "lessThanOperatorName": "scorebelow",
              "greaterThanOperatorName": "scoreabove"
            }
          }
        }
      ]
    }
  ]
}

Example configuration file

The following example configuration file shows the parameter key=value pairs that define an example connector's behavior.

# data source access
api.sourceId=1234567890abcd
api.serviceAccountPrivateKeyFile=./PrivateKey.json

# CSV data structure
csv.filePath=./movie_content.csv
csv.csvColumns=movieId,movieTitle,description,releaseYear,genre,actors,ratings,releaseDate
csv.skipHeaderRecord=true
url.format=https://mymoviesite.com/movies/{0}
url.columns=movieId
csv.datetimeFormat.releaseDate=yyyy-mm-dd
csv.multiValueColumns=genre,actors
csv.multiValue.genre=;
contentTemplate.csv.title=movieTitle

# metadata structured data and content
itemMetadata.title.field=movieTitle
itemMetadata.createTime.field=releaseDate
itemMetadata.contentLanguage.defaultValue=en-US
itemMetadata.objectType.defaultValue=movie
contentTemplate.csv.quality.medium=description
contentTemplate.csv.unmappedColumnsMode=IGNORE

#ACLs
defaultAcl.mode=fallback
defaultAcl.public=true

For detailed descriptions of each parameter, see the Configuration parameters reference.

Run the Cloud Search CSV connector

To run the connector from the command line, type the following command:

$ java -jar google-cloudsearch-csv-connector-v1-0.0.3.jar -Dconfig=my.config

By default, connector logs are available on standard output. You can log to files by specifying logging.properties.