Adversarial Testing for Generative AI

Adversarial testing is a method for systematically evaluating an ML model with the intent of learning how it behaves when provided with malicious or inadvertently harmful input. This guide describes an example adversarial testing workflow for generative AI.

What is adversarial testing?

Testing is a critical part of building robust and safe AI applications. Adversarial testing involves proactively trying to "break" an application by providing it with data most likely to elicit problematic output. Adversarial queries are likely to cause a model to fail in an unsafe manner (i.e., safety policy violations), and might cause errors that are readily apparent to humans, but difficult for machines to recognize.

Queries may be "adversarial" in different ways:

Explicitly adversarial queries may contain policy-violating language or express policy-violating points of view, or may probe or attempt to "trick" the model into saying something unsafe, harmful, or offensive.
Implicitly adversarial queries may seem innocuous but can contain sensitive topics that are contentious, culturally sensitive, or potentially harmful. These might include information on demographics, health, finance, or religion.

Adversarial testing can help teams improve models and products by exposing current failures to guide mitigation pathways, such as fine tuning, model safeguards or filters. Moreover, it can help inform product launch decisions by measuring risks that may be unmitigated, such as the likelihood that the model with output policy-violating content.

As a best practice for responsible AI, this guide provides an example workflow for adversarial testing for generative models and systems.

Adversarial testing example workflow

Adversarial testing follows a workflow that is similar to standard model evaluation.

Identify inputs for testing

Thoughtful inputs can directly influence the efficacy of the testing workflow. Adversarial tests should be sufficiently diverse and representative with respect to the product policies, failure modes, intended use cases, and edge cases. Adversarial tests should also provide wide coverage of the formulation of queries and the topics and contexts for a model's usage.

The following inputs can help define the scope and objectives of an adversarial test:

Product policy and failure modes

Generative AI products should define safety policies that describe product behavior and model outputs that are not allowed. For example, Google's Generative AI Prohibited Use Policy lists user-AI interactions that are restricted in Google products. All of these policy points should have safeguards in place to prevent them, and the full range of these failure modes form the basis for adversarial testing.
Use cases and edge cases

Test data should represent the vast range of ways that users will interact with the product in the real world. Product use cases like summarizing documents, making recommendations, or representing world cultures in images should be reflected in the queries used for adversarial testing. Likewise, test datasets should include use cases that are expected to be less common but are still possible.
Lexical diversity

Test queries should have a range of different lengths (e.g., word count), use a broad range of vocabulary, not contain duplicates, and represent different query formulations (e.g., wh-questions, direct and indirect requests).
Semantic diversity

Test queries should cover a broad range of different topics per policy, including sensitive and identity-based characteristics, across different use cases and global contexts.

Find or create test datasets

Test datasets for adversarial testing are constructed differently from standard model evaluation test sets. In standard model evaluations, you typically design your test datasets to accurately reflect the distribution of data the model will encounter in product. For adversarial tests, you want to select test data that could elicit problematic output from the model to prove the model's behavior on out-of-distribution examples and edge cases relevant to safety policies.

Find datasets

Investigate existing test datasets for coverage of safety policies, failure modes, and use cases, such as the academic benchmarks listed in the Responsible Generative AI Toolkit for diversity requirements. Teams can use existing datasets to establish a baseline of their products' performance, and then do deeper analyses on specific failure modes their products struggle with.

Create datasets

If existing test datasets are insufficient, teams can generate new data to target specific failure modes and use cases. One way to create new datasets is to start by manually creating a small dataset of queries (i.e., dozens of examples per category), and then expand on this "seed" dataset using data synthesis tools, such as BigQuery DataFrames.

Seed datasets should contain examples that are as similar as possible to what the system would encounter in production, and created with the goal of eliciting a policy violation. Highly toxic language is likely to be detected by safety features, so consider creative phrasing and implicitly adversarial inputs.

You may use direct or indirect references to sensitive attributes (e.g., age, gender, race, religion) in your test dataset. Keep in mind that the usage of these terms may vary between cultures. Vary tone, sentence structure, sentence length, word choice, and meaning. Avoid creating noise and duplication with examples where multiple labels (e.g., hate speech versus obscenity) can apply, since these might not be handled properly by evaluation or training systems.

Analyze datasets

Analyze your adversarial test sets to understand their composition in terms of lexical and semantic diversity, coverage across policy violations and use cases, and overall quality in terms of uniqueness, adversariality, and noise.

Generate and annotate model outputs

The next step is to generate model outputs based on the test dataset. Once generated, annotate them to categorize them into failure modes and harms. These outputs and their annotation labels can help provide safety signals and help measure and mitigate harms.

You can use safety classifiers to automatically annotate model outputs (or inputs) for policy violations. Accuracy may be low for signals that try to detect loosely-defined constructs, such as Hate Speech. For those signals, it is critical to use human raters to check and correct classifier-generated labels for which scores are "uncertain."

In addition to automatic annotation, you can also use human raters to annotate a sample of your data. Annotating model outputs as part of adversarial testing necessarily involves looking at troubling and potentially harmful text or images, similar to manual content moderation. Additionally, human raters may annotate the same content differently based on their personal background, knowledge or beliefs. It can be helpful to develop guidelines or templates for raters, keeping in mind that the diversity of your rater pool could influence the annotation results.

Report and mitigate

The final step is to summarize test results in a report. Compute metrics and report results to provide safety rates, visualizations, and examples of problematic failures. These results can guide model improvements and inform model safeguards, such as filters or blocklists. Reports are also important for communication with stakeholders and decision makers.

Additional Resources

Google's AI Red Team: the ethical hackers making AI safer

Red Teaming Language Models with Language Models

ML Commons AI safety working group's AI safety benchmarks

SAIF: Google's Guide to Secure AI