Safety & Fairness Considerations for Generative Models

Generative AI can be a powerful tool in unlocking creativity, increasing productivity, and simplifying everyday tasks. However, as an early stage technology, it should be used with appropriate precautions. This resource provides a high level approach for safety and fairness considerations for generative AI products.

Introduction

The rapid development of generative AI has brought features and products to market in relatively short timeframes. Teams launching products with generative AI capabilities should aim to ensure high quality, safe, fair, and equitable user experiences in accordance with the AI Principles.

A responsible approach to generative applications should provide plans for accomplishing the following:

  • Content policies, potential harms, and risks analysis
  • Responsible generation
  • Harms prevention
  • Evaluation and adversarial testing

Content Policies, Potential Harms, and Risks Analysis

Products should first align on the type of content users are not allowed to generate. Google’s Generative AI Prohibited Use Policy includes specific prohibited use cases for covered Google services.

Refer to the official policy for more details on each of these prohibited use cases. For your own product use cases, define what constitutes "good" content, beyond the absence of policy-violating, or "bad," to align with goals for responsible generation. Your team should also clearly define and describe use cases that would be considered policy violations or use "failure modes."

Content policies are just one step in preventing harm to users. It is also important to consider goals and guiding principles for quality, safety, fairness, and inclusion.

Quality

Teams should devise strategies for responding to queries in sensitive verticals such as medical information to help provide high quality user experiences. Responsible strategies include providing multiple points of view, deferring topics without scientific evidence, or only providing factual information with attribution.

Safety

The goal of AI safety measures is to prevent or contain actions that can lead to harm, intentionally or unintentionally. Without appropriate mitigations, generative models may output unsafe content that may violate content policies or cause discomfort to users. Consider providing explanations to users if an output was blocked or the model was unable to generate an acceptable output.

Fairness & Inclusion

Ensure diversity within a response and across multiple responses for the same question. For example, a response to a question about famous musicians should not only include names or images of people of the same gender identity or skin tone. Teams should strive to provide content for different communities when requested. Examine training data for diversity and representation across multiple identities, cultures, and demographics. Consider how outputs over multiple queries are representative of diversity in groups, without perpetuating common stereotypes (e.g., responses to "best jobs for women" compared to "best jobs for men" should not contain traditionally stereotyped content, such as "nurse" appearing under "best jobs for women," but "doctor" appearing under "best jobs for men").

Potential Harms & Risks Analysis

The following steps are recommended when building applications with LLMs (via PaLM API Safety guidance):

  • Understanding the safety risks of your application
  • Considering adjustments to mitigate safety risks
  • Performing safety testing appropriate to your use case
  • Soliciting feedback from users and monitoring usage

To read more about this approach, visit the PaLM API documentation.

For a deeper dive, this talk explores guidance for curbing risks and developing safe and responsible LLM-backed applications:

Responsible Generation

Built-in Model Safety

In one example of safety features, the PaLM API includes adjustable safety settings that block content with adjustable probabilities of being unsafe across six categories: derogatory, toxic, sexual, violent, dangerous, and medical. These settings allow developers to determine what is appropriate for their use cases, but also has built-in protections against core harms, such as content that endangers child safety, which are always blocked and cannot be adjusted.

Model Tuning

Fine-tuning a model can teach it how to answer based on an application's requirements. Example prompts and answers are used to teach a model how to better support new use cases, address types of harms, or utilize different strategies desired by the product in the reply.

For instance, consider:

  • Tuning model output to better reflect what is acceptable in your application context.
  • Providing an input method that facilitates safer outputs, such as restricting inputs to a dropdown list.
  • Blocking unsafe inputs and filtering output before it is shown to the user.

See PaLM API’s Safety guidance for more examples of adjustments to mitigate safety risks.

Harms Prevention

Additional methods of harms prevention may include using trained classifiers to label each prompt with potential harms or adversarial signals. Moreover, you may implement safeguards against deliberate misuse by limiting the volume of user queries submitted by a single user in a given time period, or try to protect against possible prompt injection.

Similar to input safeguards, guardrails can be placed on outputs. Content moderation guardrails, such as classifiers, can be used to detect policy violative content. If signals determine the output to be harmful, the application can provide an error or empty response, provide a pre-scripted output, or rank multiple outputs from the same prompt for safety.

Evaluation, Metrics, & Testing

Generative AI products should be rigorously evaluated to ensure they align with safety policies and guiding principles prior to launch. To create a baseline for evaluation and measure improvement over time, metrics should be defined for each salient content quality dimension. After metrics are defined, a separate risks analysis can determine the performance targets for launch, taking into account loss patterns, how likely they are to be encountered, and the impact of harms.

Examples of metrics to consider:

Safety benchmarks: design safety metrics that reflect the ways your application could be unsafe in the context of how it is likely to get used, then test how well your application performs on the metrics using evaluation datasets.

Violation rate: Given a balanced adversarial dataset (across applicable harms and use cases), the number of violative outputs, usually measured by interrater reliability.

Blank response rate: Given a balanced set of prompts that a product intends to provide a response for, number of blank responses (i.e., when the product is unable to provide a safe output regardless of the input or output being blocked).

Diversity: Given a set of prompts, the diversity along dimensions of identity attributes represented in outputs.

Fairness (for quality of service): Given a set of prompts that contain counterfactuals of a sensitive attribute, ability to provide the same quality of service.

Adversarial Testing

Adversarial testing involves proactively trying to "break" your application. The goal is to identify points of weakness so that you can take steps to remedy them.

Adversarial testing is a method for systematically evaluating an ML model with the intent of learning how it behaves when provided with malicious or inadvertently harmful input:

  • An input is malicious when the input is clearly designed to produce an unsafe or harmful output – for example, asking a text generation model to generate a hateful rant about a particular religion.
  • An input is inadvertently harmful when the input itself may be innocuous, but produces harmful output – for example, asking a text generation model to describe a person of a particular ethnicity and receiving a racist output.

Adversarial testing has two primary objectives: to help teams systematically improve models and products by exposing current failure patterns, and guide mitigation pathways, and to inform product decisions by assessing alignment to safety product policies and by measuring risks that may not be fully mitigated.

Adversarial testing follows a workflow that is similar to standard model evaluation:

  1. Find or create a test dataset
  2. Run model inference using the test dataset
  3. Annotate model output
  4. Analyze and report results

What distinguishes an adversarial test from a standard evaluation is the composition of the data used for testing. For adversarial tests, select test data that is most likely to elicit problematic output from the model. This means probing the model’s behavior for all the types of harms that are possible, including rare or unusual examples and edge cases that are relevant to safety policies. It should also include diversity in the different dimensions of a sentence such as structure, meaning and length.