LLM Inference guide

The LLM Inference API lets you run large language models (LLMs) completely on-device, which you can use to perform a wide range of tasks, such as generating text, retrieving information in natural language form, and summarizing documents. The task provides built-in support for multiple text-to-text large language models, so you can apply the latest on-device generative AI models to your apps and products.

Try it!

The task supports Gemma 2B and 7B, a part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. It also supports the following external models: Phi-2, Falcon-RW-1B and StableLM-3B.

In addition to the models supported natively, users can map other models using Google's AI Edge offerings (including mapping PyTorch models). This allows users to export a mapped model into multi-signature TensorFlow Lite models, which are bundled with tokenizer parameters to create a Task Bundle.

Get Started

Start using this task by following one of these implementation guides for your target platform. These platform-specific guides walk you through a basic implementation of this task, with code examples that use an available model and the recommended configuration options:

Task details

This section describes the capabilities, inputs, outputs, and configuration options of this task.

Features

The LLM Inference API contains the following key features:

  1. Text-to-text generation - Generate text based on an input text prompt.
  2. LLM selection - Apply multiple models to tailor the app for your specific use cases. You can also retrain and apply customized weights to the model.
  3. LoRA support - Extend and customize the LLM capability with LoRA model either by training on your all dataset, or taking prepared prebuilt LoRA models from the open-source community (native models only).
Task inputs Task outputs
The LLM Inference API accepts the following inputs:
  • Text prompt (e.g., a question, an email subject, a document to be summarized)
The LLM Inference API outputs the following results:
  • Generated text based on the input prompt (e.g., an answer to the question, an email draft, a summary of the document)

Configurations options

This task has the following configuration options:

Option Name Description Value Range Default Value
modelPath The path to where the model is stored within the project directory. PATH N/A
maxTokens The maximum number of tokens (input tokens + output tokens) the model handles. Integer 512
topK The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. When setting topK, you must also set a value for randomSeed. Integer 40
temperature The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. When setting temperature, you must also set a value for randomSeed. Float 0.8
randomSeed The random seed used during text generation. Integer 0
loraPath The absolute path to the LoRA model locally on the device. Note: this is only compatible with GPU models. PATH N/A
resultListener Sets the result listener to receive the results asynchronously. Only applicable when using the async generation method. N/A N/A
errorListener Sets an optional error listener. N/A N/A

Models

The LLM Inference API contains built-in support for severable text-to-text large language models that are optimized to run on browsers and mobile devices. These lightweight models can be downloaded to run inferences completely on-device.

Before initializing the LLM Inference API, download one of the supported models and store the file within your project directory.

Gemma 2B

Gemma 2B is a part of a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. The model contains 2B parameters and open weights. This model is well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning.

Download Gemma 2B

The Gemma 2B models come in four variants:

You can also tune the model and add new weights before adding it to the app. For more information on tuning and customizing Gemma, see Tuning Gemma. After downloading Gemma from Kaggle Models, the model is already in the appropriate format to use with MediaPipe.

If you download Gemma 2B from Hugging Face, you must convert the model to a MediaPipe-friendly format. The LLM Inference API requires the following files to be downloaded and converted:

  • model-00001-of-00002.safetensors
  • model-00002-of-00002.safetensors
  • tokenizer.json
  • tokenizer_config.json

Gemma 7B

Gemma 7B is a larger Gemma model with 7B parameters and open weights. The model is more powerful for a variety of text generation tasks, including question answering, summarization, and reasoning. Gemma 7B is only supported on Web.

Download Gemma 7B

The Gemma 7B model comes in one variant:

If you download Gemma 7B from Hugging Face, you must convert the model to a MediaPipe-friendly format. The LLM Inference API requires the following files to be downloaded and converted:

  • model-00001-of-00004.safetensors
  • model-00002-of-00004.safetensors
  • model-00003-of-00004.safetensors
  • model-00004-of-00004.safetensors
  • tokenizer.json
  • tokenizer_config.json

Falcon 1B

Falcon-1B is a 1 billion parameter causal decoder-only model trained on 350B tokens of RefinedWeb.

Download Falcon 1B

The LLM Inference API requires the following files to be downloaded and stored locally:

  • tokenizer.json
  • tokenizer_config.json
  • pytorch_model.bin

After downloading the Falcon model files, the model is ready to be converted to the MediaPipe format. Follow the steps in Convert model to MediaPipe format.

StableLM 3B

StableLM-3B is a 3 billion parameter decoder-only language model pre-trained on 1 trillion tokens of diverse English and code datasets for 4 epochs.

Download StableLM 3B

The LLM Inference API requires the following files to be downloaded and stored locally:

  • tokenizer.json
  • tokenizer_config.json
  • model.safetensors

After downloading the StableLM model files, the model is ready to be converted to the MediaPipe format. Follow the steps in Convert model to MediaPipe format.

Phi-2

Phi-2 is a 2.7 billion parameter Transformer model. It was trained using various NLP synthetic texts and filtered websites. The model is best suited for prompts using the Question-Answer, chat, and code format.

Download Phi-2

The LLM Inference API requires the following files to be downloaded and stored locally:

  • tokenizer.json
  • tokenizer_config.json
  • model-00001-of-00002.safetensors
  • model-00002-of-00002.safetensors

After downloading the Phi-2 model files, the model is ready to be converted to the MediaPipe format. Follow the steps in Convert model to MediaPipe format.

AI Edge Exported Models

AI Edge is a Google offering that lets you convert user-mapped models into multi-signature TensorFlow Lite models. For more details on mapping and exporting models, visit the AI Edge Torch GitHub page.

After exporting the model into the TFLite format, the model is ready to be converted to the MediaPipe format. For more, see Convert model to MediaPipe format.

Convert model to MediaPipe format

Native model conversion

If you are using an external LLM (Phi-2, Falcon, or StableLM) or a non-Kaggle version of Gemma, use our conversion scripts to format the model to be compatible with MediaPipe.

The model conversion process requires the MediaPipe PyPI package. The conversion script is available in all MediaPipe packages after 0.10.11.

Install and import the dependencies with the following:

$ python3 -m pip install mediapipe

Use the genai.converter library to convert the model:

import mediapipe as mp
from mediapipe.tasks.python.genai import converter

config = converter.ConversionConfig(
  input_ckpt=INPUT_CKPT,
  ckpt_format=CKPT_FORMAT,
  model_type=MODEL_TYPE,
  backend=BACKEND,
  output_dir=OUTPUT_DIR,
  combine_file_only=False,
  vocab_model_file=VOCAB_MODEL_FILE,
  output_tflite_file=OUTPUT_TFLITE_FILE,
)

converter.convert_checkpoint(config)

To convert the LoRA model, the ConversionConfig should specify the base model options as well as additional LoRA options. Notice that since the API only supports LoRA inference with GPU, the backend must be set to 'gpu'.

import mediapipe as mp
from mediapipe.tasks.python.genai import converter

config = converter.ConversionConfig(
  # Other params related to base model
  ...
  # Must use gpu backend for LoRA conversion
  backend='gpu',
  # LoRA related params
  lora_ckpt=LORA_CKPT,
  lora_rank=LORA_RANK,
  lora_output_tflite_file=LORA_OUTPUT_TFLITE_FILE,
)

converter.convert_checkpoint(config)

The converter will output two TFLite flatbuffer files, one for the base model and the other for the LoRA model.

Parameter Description Accepted Values
input_ckpt The path to the model.safetensors or pytorch.bin file. Note that sometimes the model safetensors format are sharded into multiple files, e.g. model-00001-of-00003.safetensors, model-00001-of-00003.safetensors. You can specify a file pattern, like model*.safetensors. PATH
ckpt_format The model file format. {"safetensors", "pytorch"}
model_type The LLM being converted. {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}
backend The processor (delegate) used to run the model. {"cpu", "gpu"}
output_dir The path to the output directory that hosts the per-layer weight files. PATH
output_tflite_file The path to the output file. For example, "model_cpu.bin" or "model_gpu.bin". This file is only compatible with the LLM Inference API, and cannot be used as a general `tflite` file. PATH
vocab_model_file The path to the directory that stores the tokenizer.json and tokenizer_config.json files. For Gemma, point to the single tokenizer.model file. PATH
lora_ckpt The path to the LoRA ckpt of safetensors file that stores the LoRA adapter weight. PATH
lora_rank An integer representing the rank of LoRA ckpt. Required in order to convert the lora weights. If not provided, then the converter assumes there are no LoRA weights. Note: Only the GPU backend supports LoRA. Integer
lora_output_tflite_file Output tflite filename for the LoRA weights. PATH

AI Edge model conversion

If you are using an LLM mapped to a TFLite model through AI Edge, use our bundling script to create a Task Bundle. The bundling process packs the mapped model with additional metadata (e.g., Tokenizer Parameters) needed to run end-to-end inference.

The model bundling process requires the MediaPipe PyPI package. The conversion script is available in all MediaPipe packages after 0.10.14.

Install and import the dependencies with the following:

$ python3 -m pip install mediapipe

Use the genai.bundler library to bundle the model:

import mediapipe as mp
from mediapipe.tasks.python.genai import bundler

config = bundler.BundleConfig(
    tflite_model=TFLITE_MODEL,
    tokenizer_model=TOKENIZER_MODEL,
    start_token=START_TOKEN,
    stop_tokens=STOP_TOKENS,
    output_filename=OUTPUT_FILENAME,
    enable_bytes_to_unicode_mapping=ENABLE_BYTES_TO_UNICODE_MAPPING,
)
bundler.create_bundle(config)
Parameter Description Accepted Values
tflite_model The path to the AI Edge exported TFLite model. PATH
tokenizer_model The path to the SentencePiece tokenizer model. PATH
start_token Model specific start token. The start token must be present in the provided tokenizer model. STRING
stop_tokens Model specific stop tokens. The stop tokens must be present in the provided tokenizer model. LIST[STRING]
output_filename The name of the output task bundle file. PATH

LoRA customization

Mediapipe LLM inference API can be configured to support Low-Rank Adaptation (LoRA) for large language models. Utilizing fine-tuned LoRA models, developers can customize the behavior of LLMs through a cost-effective training process.

LoRA support of the LLM Inference API works for Gemma-2B and Phi-2 models for the GPU backend, with LoRA weights applicable to attention layers only. This initial implementation serves as an experimental API for future developments with plans to support more models and various types of layers in the coming updates.

Prepare LoRA models

Follow the instructions on HuggingFace to train a fine tuned LoRA model on your own dataset with supported model types, Gemma-2B or Phi-2. Gemma-2B and Phi-2 models are both available on HuggingFace in the safetensors format. Since LLM Inference API only supports LoRA on attention layers, only specify attention layers while creating the LoraConfig as following:

# For Gemma-2B
from peft import LoraConfig
config = LoraConfig(
    r=LORA_RANK,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)

# For Phi-2
config = LoraConfig(
    r=LORA_RANK,
    target_modules=["q_proj", "v_proj", "k_proj", "dense"],
)

For testing, there are publicly accessible fine-tuned LoRA models which fit LLM Inference API available on HuggingFace. For example, monsterapi/gemma-2b-lora-maths-orca-200k for Gemma-2B and lole25/phi-2-sft-ultrachat-lora for Phi-2.

After training on the prepared dataset and saving the model, you obtain an adapter_model.safetensors file containing the fine-tuned LoRA model weights. The safetensors file is the LoRA checkpoint used in the model conversion.

As the next step, you need convert the model weights into a TensorFlow Lite Flatbuffer using the MediaPipe Python Package. The ConversionConfig should specify the base model options as well as additional LoRA options. Notice that since the API only supports LoRA inference with GPU, the backend must be set to 'gpu'.

import mediapipe as mp
from mediapipe.tasks.python.genai import converter

config = converter.ConversionConfig(
  # Other params related to base model
  ...
  # Must use gpu backend for LoRA conversion
  backend='gpu',
  # LoRA related params
  lora_ckpt=LORA_CKPT,
  lora_rank=LORA_RANK,
  lora_output_tflite_file=LORA_OUTPUT_TFLITE_FILE,
)

converter.convert_checkpoint(config)

The converter will output two TFLite flatbuffer files, one for the base model and the other for the LoRA model.

LoRA model inference

The Web, Android and iOS LLM Inference API are updated to support LoRA model inference. Web supports dynamic LoRA, which can switch different LoRA models during runtime. Android and iOS support static LoRA, which uses the same LoRA weights during the lifetime of the task.

Android supports static LoRA during initialization. To load a LoRA model, users specify the LoRA model path as well as the base LLM.

// Set the configuration options for the LLM Inference task
val options = LlmInferenceOptions.builder()
        .setModelPath('<path to base model>')
        .setMaxTokens(1000)
        .setTopK(40)
        .setTemperature(0.8)
        .setRandomSeed(101)
        .setLoraPath('<path to LoRA model>')
        .build()

// Create an instance of the LLM Inference task
llmInference = LlmInference.createFromOptions(context, options)

To run LLM inference with LoRA, use the same generateResponse() or generateResponseAsync() methods as the base model.