Google 会使用 AI 技术将内容翻译成您偏好的语言。AI 翻译可能包含错误。

通过前缀缓存优化推理速度

前缀缓存是一项功能，通过存储和重复使用处理共享和重复提示前缀部分的中间 LLM 状态来缩短推理时间。如需启用前缀缓存，您只需在 API 请求中将静态前缀与动态后缀分开即可。

前缀缓存目前仅支持纯文本输入，因此如果您在提示中提供图片，则不应使用此功能。

实现前缀缓存有两种方法：隐式或显式：

隐式前缀缓存是一种轻量级方法，其中应用只需定义提示的共享部分。
显式前缀缓存允许应用对缓存进行更多控制，包括缓存创建、查询和删除。

隐式使用前缀缓存

如需启用前缀缓存，请将提示的共享部分添加到 promptPrefix 字段，如以下代码段所示：

Kotlin

val promptPrefix = "Reverse the given sentence: "
val dynamicSuffix = "Hello World"

val result = generativeModel.generateContent(
  generateContentRequest(TextPart(dynamicSuffix)) {
    promptPrefix = PromptPrefix(promptPrefix)
  }
)

Java

String promptPrefix = "Reverse the given sentence: ";
String dynamicSuffix = "Hello World";

GenerateContentResponse response = generativeModelFutures.generateContent(
    new GenerateContentRequest.Builder(new TextPart(dynamicSuffix))
    .setPromptPrefix(new PromptPrefix(promptPrefix))
    .build())
    .get();

在上面的代码段中，dynamicSuffix 作为主要内容传递，promptPrefix 单独提供。

估计性能提升

不使用前缀缓存

使用前缀缓存命中

（首次使用前缀时可能会发生前缀缓存未命中）

Pixel 9，具有 300 个令牌的固定前缀和 50 个令牌的动态后缀提示

0.82 秒

0.45 秒

Pixel 9，具有 1,000 个令牌的固定前缀和 100 个令牌的动态后缀提示

2.11 秒

0.5 秒

存储注意事项

使用隐式前缀缓存时，缓存文件会保存在客户端应用的私有存储空间中，这会增加应用的存储空间使用量。系统会存储加密的缓存文件及其关联的元数据，包括原始前缀文本。请注意以下存储注意事项：

缓存数量由 LRU（最近最少使用）机制管理。当超出最大缓存总量时，最少使用的缓存会自动删除。
提示缓存大小取决于前缀的长度。
如需清除通过前缀缓存创建的所有缓存，请使用 generativeMode.clearImplicitCaches() 方法。

注意： clearImplicitCaches() 方法处于实验阶段，将来可能会发生变化。

使用显式缓存管理

Prompt API 包含显式缓存管理方法，让开发者可以更精确地控制缓存的创建、搜索、使用和移除方式。这些手动操作独立于系统的自动缓存处理运行。

此示例说明了如何初始化显式缓存管理并执行推理：

Kotlin

val cacheName = "my_cache"
val promptPrefix = "Reverse the given sentence: "
val dynamicSuffix = "Hello World"

// Create a cache
val cacheRequest = createCachedContextRequest(cacheName, PromptPrefix(promptPrefix))
val cache = generativeModel.caches.create(cacheRequest)

// Run inference with the cache
val response = generativeModel.generateContent(
  generateContentRequest(TextPart(dynamicSuffix)) {
    cachedContextName = cache.name
  }
)

Java

String cacheName = "my_cache";
String promptPrefix = "Reverse the given sentence: ";
String dynamicSuffix = "Hello World";

// Create a cache
CachedContext cache = cachesFutures.create(
  new CreateCachedContextRequest.Builder(cacheName, new PromptPrefix(promptPrefix))
  .build())
  .get();

// Run inference with the cache
GenerateContentResponse response = generativeModelFutures.generateContent(
  new GenerateContentRequest.Builder(new TextPart(dynamicSuffix))
  .setCachedContextName(cache.getName())
  .build())
  .get();

此示例演示了如何使用 generativeModel.caches 查询、检索和删除显式管理的缓存：

Kotlin

val cacheName = "my_cache"

// Query pre-created caches
for (cache in generativeModel.caches.list()) {
  // Do something with cache
}

// Get specific cache
val cache = generativeModel.caches.get(cacheName)

// Delete a pre-created cache
generativeModel.caches.delete(cacheName)

Java

String cacheName = "my_cache";

// Query pre-created caches
for (PrefixCache cache : cachesFutures.list().get()) {
  // Do something with cache
}

// Get specific cache
PrefixCache cache = cachesFutures.get(cacheName).get();

// Delete a pre-created cache
cachesFutures.delete(cacheName);

通过前缀缓存优化推理速度 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

隐式使用前缀缓存

Kotlin

Java

估计性能提升

存储注意事项

使用显式缓存管理

Kotlin

Java

Kotlin

Java

通过前缀缓存优化推理速度