The ML.PROCESS_DOCUMENT function

This document describes the ML.PROCESS_DOCUMENT function, which lets you process unstructured documents from an object table.

Syntax

ML.PROCESS_DOCUMENT(
  MODEL `project_id.dataset.model_name`,
  TABLE `project_id.dataset.object_table`
)

Arguments

ML.PROCESS_DOCUMENT takes the following arguments:

  • project_id: Your project ID.

  • dataset: The BigQuery dataset that contains the model.

  • model: The name of a remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_DOCUMENT_V1.

  • object_table: The name of the object table that contains URIs of the documents.

    The documents in the object table must be of a supported type. An error is returned for any row that contains a document of an unsupported type.

Output

ML.PROCESS_DOCUMENT returns the following columns:

  • ml_process_document_result: a JSON value that contains the entities returned by the Document AI API.
  • ml_process_document_status: a STRING value that contains the API response status for the corresponding row. This value is empty if the operation was successful.
  • The fields returned by the processor specified in the model.
  • The object table columns.

Quotas

See Cloud AI service functions quotas and limits.

Known issues

Sometimes after a query job that uses this function finishes successfully, some returned rows contain the following error message:

A retryable error occurred: RESOURCE EXHAUSTED error from <remote endpoint>

This issue occurs because BigQuery query jobs finish successfully even if the function fails for some of the rows. The function fails when the volume of API calls to the remote endpoint exceeds the quota limits for that service. This issue occurs most often when you are running multiple parallel batch queries. BigQuery retries these calls, but if the retries fail, the resource exhausted error message is returned.

Locations

ML.PROCESS_DOCUMENT must run in the same region as the remote model that the function references. You can only create models based on Document AI in the US and EU multi-regions.

Limitations

The function can't process documents with more than 15 pages. Any row that contains such a file returns an error.

Example

The following example uses the invoice parser to process the documents represented by the documents table.

Create the model:

# Create model
CREATE OR REPLACE MODEL
`myproject.mydataset.invoice_parser`
REMOTE WITH CONNECTION `myproject.myregion.myconnection`
OPTIONS (remote_service_type = 'cloud_ai_document_v1',
document_processor='projects/project_number/locations/processor_location/processors/processor_id/processorVersions/version_id');

Process the documents:

SELECT *
FROM ML.PROCESS_DOCUMENT(
  MODEL `myproject.mydataset.invoice_parser`,
  TABLE `myproject.mydataset.documents`
);

The result is similar to the following:

ml_process_document_result ml_process_document_status invoice_type currency ...
{"entities":[{"confidence":1,"id":"0","mentionText":"10 105,93 10,59","pageAnchor":{"pageRefs":[{"boundingPoly":{"normalizedVertices":[{"x":0.40452111,"y":0.67199326},{"x":0.74776918,"y":0.67199326},{"x":0.74776918,"y":0.68208581},{"x":0.40452111,"y":0.68208581}]}}]},"properties":[{"confidence":0.66... USD

What's next