Query open table formats with manifests

This document describes how to use manifest files to query data stored in open table formats such as Apache Hudi and Delta Lake.

Some open table formats such as Hudi and Delta Lake export their current state as one or more manifest files. A manifest file contains a list of data files that make tables. With the manifest support in BigQuery, you can query and load data stored in open table formats.

Before you begin

Required roles

To query BigLake tables based on Hudi and Delta Lake data, ensure you have the following roles:

  • BigQuery Connection User (roles/bigquery.connectionUser)
  • BigQuery Data Viewer (roles/bigquery.dataViewer)
  • BigQuery User (roles/bigquery.user)

You can also query Hudi external tables. However, we recommend you to upgrade the external table to BigLake. To query Hudi external tables, ensure you have the following roles:

  • BigQuery Data Viewer (roles/bigquery.dataViewer)
  • BigQuery User (roles/bigquery.user)
  • Storage Object Viewer (roles/storage.objectViewer)

Depending on your permissions, you can grant these roles to yourself or ask your administrator to grant them to you. For more information about granting roles, see Viewing the grantable roles on resources.

To see the exact permissions that are required to query BigLake tables, expand the Required permissions section:

Required permissions

You might also be able to get these permissions with custom roles or other predefined roles.

Query Hudi workloads

To query Hudi data, follow these steps:

  1. Create an external table based on Hudi data.
  2. Upgrade the external table to BigLake.

Create Hudi external tables

When you sync tables by using the sync tool for Hudi and BigQuery, enable the use-bq-manifest-file flag to transition to the manifest file approach. This flag also exports a manifest file in a format supported by BigQuery and uses it to create an external table with the name specified in the --table parameter.

To create a Hudi external table, follow these steps:

  1. To create a Hudi external table, submit a job to an existing Dataproc cluster. When you build the Hudi-BigQuery connector, enable the use-bq-manifest-file flag to transition to the manifest file approach. This flag exports a manifest file in a format supported by BigQuery and uses it to create an external table with the name specified in the --table parameter.

    spark-submit \
       --master yarn \
       --packages com.google.cloud:google-cloud-bigquery:2.10.4 \
       --class org.apache.hudi.gcp.bigquery.BigQuerySyncTool  \
       JAR \
       --project-id PROJECT_ID \
       --dataset-name DATASET \
       --dataset-location LOCATION \
       --table TABLE \
       --source-uri URI  \
       --source-uri-prefix URI_PREFIX \
       --base-path BASE_PATH  \
       --partitioned-by PARTITION_BY \
       --use-bq-manifest-file
    

    Replace the following:

    • JAR: If you are using the Hudi-BigQuery connector, specify hudi-gcp-bundle-0.14.0.jar. If you are using the Hudi component in Dataproc 2.1, specify /usr/lib/hudi/tools/bq-sync-tool/hudi-gcp-bundle-0.12.3.1.jar

    • PROJECT_ID: the project ID in which you want to create the Hudi BigLake table

    • DATASET: the dataset in which you want to create the Hudi BigLake table

    • LOCATION: the location in which you want to create the Hudi BigLake table

    • TABLE: the name of the table that you want to create

      If you are transitioning from the earlier version of the Hudi-BigQuery connector (0.13.0 and earlier) that created views on the manifest files, ensure that you use the same table name as it lets you keep the existing downstream pipeline code.

    • URI: the Cloud Storage URI that you created to store the Hudi manifest file

      This URI points to the first level partition; make sure to include the partition key. For example, gs://mybucket/hudi/mydataset/EventDate=*

    • URI_PREFIX: the prefix for the Cloud Storage URI path, usually it's the path to Hudi tables

    • BASE_PATH: the base path for Hudi tables

      For example, gs://mybucket/hudi/mydataset/

    • PARTITION_BY: the partition value

      For example, EventDate

    For more information about the connector's configuration, see Hudi-BigQuery connector.

  2. To set appropriate fine-grained controls or to accelerate the performance by enabling metadata caching, see Upgrade BigLake tables.

Query Delta workloads

To query Delta workloads, follow these steps:

  1. Generate a manifest file.
  2. Create a BigLake table based on the manifest file.
  3. Set appropriate fine-grained controls or accelerate the performance by enabling metadata caching. To do this, see Upgrade BigLake tables.

Generate a manifest file

BigQuery supports the manifest file in a SymLinkTextInputFormat format, which is a newline-delimited list of URIs. For more information about generating a manifest file, see Set up Presto to Delta Lake integration and query Delta tables.

To generate a manifest file, submit a job to an existing Dataproc cluster:

SQL

Using Spark, run the following command on a Delta table at location path-to-delta-table:

GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`

Scala

Using Spark, run the following command on a Delta table at location path-to-delta-table:

val deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")

Java

Using Spark, run the following command on a Delta table at location path-to-delta-table:

DeltaTable deltaTable = DeltaTable.forPath(<path-to-delta-table>);
deltaTable.generate("symlink_format_manifest");

Python

Using Spark, run the following command on a Delta table at location path-to-delta-table:

deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")

Create Delta BigLake tables

To create a Delta BigLake table, use the CREATE EXTERNAL TABLE statement with the file_set_spec_type field set to NEW_LINE_DELIMITED_MANIFEST:

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, run the CREATE EXTERNAL TABLE statement:

    CREATE EXTERNAL TABLE PROJECT_ID.DATASET_NAME.TABLE_NAME
    WITH PARTITION COLUMNS(
    `PARTITION_COLUMN PARTITION_COLUMN_TYPE`,)
    WITH CONNECTION `PROJECT_IDREGION.CONNECTION_NAME`
    OPTIONS (
       format = "DATA_FORMAT",
       uris = ["URI"],
       file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST',
       hive_partition_uri_prefix = "PATH_TO_DELTA_TABLE"
       max_staleness = STALENESS_INTERVAL,
       metadata_cache_mode = 'CACHE_MODE');
    

    Replace the following:

    • DATASET_NAME: the name of the dataset you created
    • TABLE_NAME: the name you want to give to this table
    • REGION: the location where the connection is located (for example, us-east1)
    • CONNECTION_NAME: the name of the connection you created
    • DATA_FORMAT: any of the supported formats (such as PARQUET)
    • URI: the path to the manifest file (for example, gs://mybucket/path)
    • PATH_TO_DELTA_TABLE: a common prefix for all source URIs before the partition key encoding begins
    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the BigLake table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, see Metadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify an interval literal value between 30 minutes and 7 days. For example, specify INTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Delta Lake instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information about metadata caching considerations, see Metadata caching for performance.

      Set to AUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set to MANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call the BQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must set CACHE_MODE if STALENESS_INTERVAL is set to a value greater than 0.

    Example:

    CREATE EXTERNAL TABLE mydataset.mytable
    WITH CONNECTION `us-east1.myconnection`
    OPTIONS (
        format="PARQUET",
        uris=["gs://mybucket/path/partitionpath=*"],
        file_set_spec_type = 'NEW_LINE_DELIMITED_MANIFEST'
        hive_partition_uri_prefix = "gs://mybucket/path/"
        max_staleness = INTERVAL 1 DAY,
        metadata_cache_mode = 'AUTOMATIC'
    );
    

Upgrade BigLake tables

You can also accelerate the performance of your workloads by taking advantage of metadata caching and materialized views. If you want to use metadata caching, you can specify settings for this at the same time. To get table details such as source format and source URI, see Get table information.

To update an external table to a BigLake table or update an existing BigLake, select one of the following options:

SQL

Use the CREATE OR REPLACE EXTERNAL TABLE DDL statement to update a table:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In the query editor, enter the following statement:

    CREATE OR REPLACE EXTERNAL TABLE
      `PROJECT_ID.DATASET.EXTERNAL_TABLE_NAME`
      WITH CONNECTION `REGION.CONNECTION_ID`
      OPTIONS(
        format ="TABLE_FORMAT",
        uris = ['BUCKET_PATH'],
        max_staleness = STALENESS_INTERVAL,
        metadata_cache_mode = 'CACHE_MODE'
        );
    

    Replace the following:

    • PROJECT_ID: the name of the project that contains the table
    • DATASET: the name of the dataset that contains the table
    • EXTERNAL_TABLE_NAME: the name of the table
    • REGION: the region that contains the connection
    • CONNECTION_ID: the name of the connection to use
    • TABLE_FORMAT: the format used by the table

      You can't change this when updating the table.

    • BUCKET_PATH: the path to the Cloud Storage bucket that contains the data for the external table, in the format ['gs://bucket_name/[folder_name/]file_name'].

      You can select multiple files from the bucket by specifying one asterisk (*) wildcard character in the path. For example, ['gs://mybucket/file_name*']. For more information, see Wildcard support for Cloud Storage URIs.

      You can specify multiple buckets for the uris option by providing multiple paths.

      The following examples show valid uris values:

      • ['gs://bucket/path1/myfile.csv']
      • ['gs://bucket/path1/*.csv']
      • ['gs://bucket/path1/*', 'gs://bucket/path2/file00*']

      When you specify uris values that target multiple files, all of those files must share a compatible schema.

      For more information about using Cloud Storage URIs in BigQuery, see Cloud Storage resource path.

    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the table, and how fresh the cached metadata must be in order for the operation to use it

      For more information about metadata caching considerations, see Metadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify an interval literal value between 30 minutes and 7 days. For example, specify INTERVAL 4 HOUR for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.

    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually

      For more information on metadata caching considerations, see Metadata caching for performance.

      Set to AUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set to MANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call the BQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must set CACHE_MODE if STALENESS_INTERVAL is set to a value greater than 0.

  3. Click Run.

For more information about how to run queries, see Run an interactive query.

bq

Use the bq mkdef and bq update commands to update a table:

  1. Generate an external table definition, that describes the aspects of the table to change:

    bq mkdef --connection_id=PROJECT_ID.REGION.CONNECTION_ID \
    --source_format=TABLE_FORMAT \
    --metadata_cache_mode=CACHE_MODE \
    "BUCKET_PATH" > /tmp/DEFINITION_FILE
    

    Replace the following:

    • PROJECT_ID: the name of the project that contains the connection
    • REGION: the region that contains the connection
    • CONNECTION_ID: the name of the connection to use
    • TABLE_FORMAT: the format used by the table. You can't change this when updating the table.
    • CACHE_MODE: specifies whether the metadata cache is refreshed automatically or manually. For more information on metadata caching considerations, see Metadata caching for performance.

      Set to AUTOMATIC for the metadata cache to be refreshed at a system-defined interval, usually somewhere between 30 and 60 minutes.

      Set to MANUAL if you want to refresh the metadata cache on a schedule you determine. In this case, you can call the BQ.REFRESH_EXTERNAL_METADATA_CACHE system procedure to refresh the cache.

      You must set CACHE_MODE if STALENESS_INTERVAL is set to a value greater than 0.

    • BUCKET_PATH: the path to the Cloud Storage bucket that contains the data for the external table, in the format gs://bucket_name/[folder_name/]file_name.

      You can limit the files selected from the bucket by specifying one asterisk (*) wildcard character in the path. For example, gs://mybucket/file_name*. For more information, see Wildcard support for Cloud Storage URIs.

      You can specify multiple buckets for the uris option by providing multiple paths.

      The following examples show valid uris values:

      • gs://bucket/path1/myfile.csv
      • gs://bucket/path1/*.csv
      • gs://bucket/path1/*,gs://bucket/path2/file00*

      When you specify uris values that target multiple files, all of those files must share a compatible schema.

      For more information about using Cloud Storage URIs in BigQuery, see Cloud Storage resource path.

    • DEFINITION_FILE: the name of the table definition file that you are creating.

  2. Update the table using the new external table definition:

    bq update --max_staleness=STALENESS_INTERVAL \
    --external_table_definition=/tmp/DEFINITION_FILE \
    PROJECT_ID:DATASET.EXTERNAL_TABLE_NAME
    

    Replace the following:

    • STALENESS_INTERVAL: specifies whether cached metadata is used by operations against the table, and how fresh the cached metadata must be in order for the operation to use it. For more information about metadata caching considerations, see Metadata caching for performance.

      To disable metadata caching, specify 0. This is the default.

      To enable metadata caching, specify an interval value between 30 minutes and 7 days, using the Y-M D H:M:S format described in the INTERVAL data type documentation. For example, specify 0-0 0 4:0:0 for a 4 hour staleness interval. With this value, operations against the table use cached metadata if it has been refreshed within the past 4 hours. If the cached metadata is older than that, the operation retrieves metadata from Cloud Storage instead.

    • DEFINITION_FILE: the name of the table definition file that you created or updated.

    • PROJECT_ID: the name of the project that contains the table

    • DATASET: the name of the dataset that contains the table

    • EXTERNAL_TABLE_NAME: the name of the table

Query BigLake and external tables

After creating a BigLake table, you can query it using GoogleSQL syntax, the same as if it were a standard BigQuery table. For example, SELECT field1, field2 FROM mydataset.my_cloud_storage_table;.

Limitations

  • BigQuery only supports querying Delta Lake reader v1 tables.

  • Hudi and BigQuery integration only works for hive-style partitioned copy-on-write tables.

  • Using manifests to query data stored in third-party storage is not supported.

What's next