Protected App Signals to support relevant app install ads

This proposal is subject to the Privacy Sandbox enrollment process and attestations. For further information regarding the attestations, please refer to the attestation link provided. Future updates to this proposal will describe the requirements for gaining access to this system.

Mobile app install ads, also known as user acquisition ads, are a type of mobile advertising designed to encourage users to download a mobile app. These ads are typically served to users based on their interests and demographics, and they often appear in other mobile apps such as games, social media, and news apps. When a user clicks on an app install ad, they are taken directly to the app store to download the app.

For example, an advertiser who is trying to drive new installs for a new mobile food delivery app in the United States might target their ads to users whose location is in the US and who have previously engaged with other food delivery apps.

This is typically implemented by including contextual, first party, and third party signals between ad techs to create user profiles based on Advertising IDs. Ad tech machine learning models use this information as inputs to choose ads that are relevant to the user and have the highest probability of resulting in a conversion.

The following APIs are proposed to support effective app install ads that improve user privacy by removing reliance on cross-party user identifiers:

  1. Protected App Signals API: This is centered around the storage and creation of ad tech engineered features which represent a user's potential interests. Ad techs store self-defined per-app event signals, such as app installs, first opens, user actions (in-game leveling, achievements), purchase activities, or time in app. Signals are written and stored on device to safeguard against data leakage, and are only made available to the ad tech logic that stored the given signal during a Protected Auction running in a secure environment.
  2. Ad Selection API: This provides an API to configure and execute a Protected Auction running in a Trusted Execution Environment (TEE) where ad techs retrieve ads candidates, run inference, compute bids, and do scoring to choose a "winning" ad using both the Protected App Signals and publisher-provided real-time contextual information.
Diagram showing the app install flow with protected signals
Flowchart that shows the Protected App Signals and ad selection workflow in the Privacy Sandbox on Android.

Here is a high-level overview of how Protected App Signals work to support relevant app install ads. The following sections of this document provide more detail on each of these steps.

  • Signal curation: As users use mobile apps, ad techs curate signals by storing ad tech defined app events for serving relevant ads using the Protected App Signals API. These events are stored in protected on-device storage, similar to Custom Audiences, and are encrypted before being sent off the device such that only Bidding and Auction services running within Trusted Execution Environments with the appropriate security and privacy control can decrypt them for bidding and scoring ads.
  • Signal Encoding: Signals are prepared on a scheduled cadence by custom ad tech logic. An Android background job executes this logic to perform on-device encoding to generate a payload of Protected App Signals that can later be used in real-time for ad selection during a Protected Auction. The payload is securely stored on the device until sent for an auction.
  • Ad Selection: To select relevant ads for the user, seller SDKs sends an encrypted payload of Protected App Signals and configures a Protected Auction. In the auction, buyer custom logic prepares the Protected App Signals along with the publisher-provided contextual data (data typically shared in an Open-RTB ad request) to engineer features intended for ad selection (ad retrieval, inference, and bid generation). Similar to Protected Audience, buyers submit ads to the seller for final scoring in a Protected Auction.
    • Ad Retrieval: Buyers use Protected App Signals and publisher-provided contextual data to engineer features relevant to the user's interests. These features are used to match ads that meet targeting criteria. Ads not within budget are filtered out. The top k ads are then selected for bidding.
    • Bidding: Buyers' custom bidding logic prepares the publisher-provided contextual data and Protected App Signals to engineer features that are used as input to buyer machine learning models for inference and bidding on candidate ads within trusted privacy-preserving boundaries. The buyer will then return their chosen ad to the seller.
    • Seller Scoring: Sellers' custom scoring logic scores ads submitted by participating Buyers and chooses a winning ad to be sent back to the app for rendering.
  • Reporting: Auction participants receive applicable win reports and loss reports. We are exploring privacy-preserving mechanisms for including data for model training in the win report.

Timeline

Developer Preview Beta
Feature Q4'23 Q1'24 Q2'24 Q3'24
Signal Curation APIs On-device storage APIs On-device storage quota logic

On-device custom logic daily updates
N/A Available for 1% T+ Devices
Ad retrieval server in a TEE MVP Available on GCP

Support for Top K
UDF productionisation
Available on AWS

Consented Debugging, Metrics, and Monitoring
Inference Service in a TEE

Support for running ML models and using them for bidding in a TEE
In Development Available on GCP

Ability to deploy & prototype static ML models using Tensorflow and PyTorch
Available on AWS

Productionized model deployment for Tensorflow and PyTorch models

Telemetry, Consented Debugging, and Monitoring
Bidding and Auction Support in a TEE

Available on GCP PAS-B&A and TEE Ad Retrieval Integration (with gRPC and TEE<>TEE encryption)

Ad Retrieval support through contextual path (includes B&A<>K/V support on TEE)
Available on AWS

Debug reporting

Consented Debugging, Metrics, and Monitoring

Curate Protected App Signals

A signal is a representation of various user interactions in an app that are determined by ad tech to be useful for serving relevant ads. An app or the integrated SDK may store or delete Protected App Signals defined by ad techs based on user activity, such as app opens, achievements, purchase activity, or time in app. Protected App Signals are stored securely on-device, and are encrypted before being sent off the device such that only Bidding and Auction services running within Trusted Execution Environments with appropriate security and privacy control can decrypt it for bidding and scoring ads. Similar to the Custom Audience API, the signals stored on a device cannot be read or inspected by apps or SDKs; there is no API for reading signal values, and APIs are designed to avoid leaking the presence of signals. Ad tech custom logic has protected access to their curated signals to engineer features that serve as the basis for ad selection in a Protected Auction.

Protected App Signals API

The Protected App Signals API supports the management of signals using a delegation mechanism similar to the one used for custom audiences. The Protected App Signals API enables signal storage in the form of a single scalar value or as a time series. Time-series signals can be used to store things like user session duration. Time series signals offer a utility to enforce a given length by using a first in, first out eviction rule. The data type of a scalar signal, or each of the elements of a time-series signal, is a byte array. Each value is enriched with the package name of the application that stored the signal, and the creation timestamp of the store signal API call. This extra information is available in the signal encoding JavaScript. This example shows the structure of the signals owned by a given ad tech:

This example demonstrates a scalar signal and a time series signal associated with adtech1.com:

  • A scalar signal with base64 value key "A1c" and value "c12Z". The signal store has been triggered by com.google.android.game_app on 1 June 2023.
  • A list of signals with key "dDE" that have been created by two different applications.

Ad techs are allotted a certain amount of space to store signals on the device. Signals will have a max TTL, which is to be determined.

Protected App Signals are removed from the storage if the generating application is uninstalled, blocked from using Protected App Signals API, or if the app data is cleared by the user.

The Protected App Signals API is composed of the following parts:

  • a Java and JavaScript API to add, update or remove signals.
  • a JavaScript API to process the persisted signals to prepare them for further feature engineering in real time during a Protected Auction running in a Trusted Execution Environment (TEE).

Add, update or remove signals

Ad techs can add, update or remove signals with the fetchSignalUpdates() API. This API supports delegation similar to Protected Audience custom audience delegation.

To add a signal, buyer ad techs that don't have an SDK presence in apps need to collaborate with ad techs who have an on-device presence, such as mobile measurement partners (MMPs) and supply-side platforms (SSPs). The Protected App Signals API aims to support these ad tech by providing flexible solutions for Protected App Signal management by enabling on device callers to invoke Protected App Signal creation on behalf of buyers. This process is called delegation and leverages the fetchSignalUpdates() API. fetchSignalUpdates() takes a URI and retrieves a list of signal updates. To illustrate, fetchSignalUpdates() issues a GET request to the given URI to retrieve the list of updates to apply to the local signal storage. The URL endpoint, owned by the buyer, responds back with a JSON list of commands.

The supported JSON commands are:

  • put: inserts or override a scalar value for the given key.
  • put_if_not_present: inserts a scalar value for the given key if there is no value already stored. This option could be useful for example to set an experiment ID for the given user and avoid overriding it if it was already set by a different application.
  • append: adds an element to the time series associated with the given key. The maxSignals parameter specifies the max number of signals in the time series. If the size is exceeded the earlier elements are removed. If the key contains a scalar value it is automatically transformed into a time series.
  • remove: removes the content associated with the given key.
{
   "put": {
    "A1c": "c12Z",
    "dDE": "d23d",
  },
  "put_if_not_present": {
    "aA9": "a1zfC3"
  }
  "append": {
    "bB1": {"values": ["gh12D", "d45g"], "maxSignals": 20}
  },
  "remove": ["c0D"]
}

All keys and values are expressed in Base64.

The commands listed above are intended to provide insert, overwrite, and delete semantics for scalar signals, and insert, append, and full series overwrite for time series signals. Delete and overwrite semantics on specific elements of a time series signal must be managed during the encoding and compaction process; for example, during encoding ignoring values in a time series that are superseded or corrected by more recent ones and deleting them during the compaction process.

Stored signals are automatically associated with the application performing the fetch request, and the responder of the request (the "site" or "origin" of an enrolled ad tech), plus the creation time of the request. All signals are subject to being stored on behalf of an Privacy Sandbox enrolled ad tech, the URI "site"/"origin" needs to match the data of an enrolled ad tech. If the requesting ad tech is not enrolled, the request is rejected.

Storage quota and eviction

Each ad tech has a limited amount of space on the user device to store signals. This quota is defined per ad tech, so signals curated from different apps share quota. If the quota is exceeded, the system clears up space by removing earlier signal values on a first in, first out basis. To prevent eviction from being executed too frequently, the system implements a batching logic to allow for a limited amount of quota overdraft and to clear up some extra space once the eviction logic is triggered.

On-device encoding for data transmission

To prepare signals for ad selection, per buyer custom logic has protected access to the stored per-app signals and events. An Android system background job runs hourly to execute per-buyer custom encoding logic that is downloaded to the device. The per-buyer custom encoding logic encodes the per-app signals, and then compresses the per-app signals into a payload that complies with the per-buyer quota. The payload is then encrypted within the boundaries of protected device storage, and then transmitted to Bidding and Auction services.

Ad techs define the level of signal processing handled by their own custom logic. For example, you could instrument your solution to discard earlier signals, and aggregate similar or reinforcing signals from different applications into new signals that use less space.

If a buyer has not registered a signal encoder, then signals are not prepared, and none of the on-device curated signals are sent to Bidding and Auction services.

More details on storage, payload, and request quotas will be available in a future update. In addition, we will provide further information on how to provide custom functions.

Ad selection workflow

With this proposal, ad tech custom code can only access the Protected App Signals within a Protected Auction (Ad Selection API) running in a TEE. To further support the needs for the app install use case, candidate ads are fetched during the Protected Auction in real-time. This contrasts with the remarketing use case where candidate ads are known before the auction.

This proposal uses a similar ad selection workflow as the Protected Audience proposal with updates to support the app install use case. To support the computing requirements for feature engineering and real-time ad selection, auctions for app install ads are required to run on Bidding and Auction services running in TEEs. Access to Protected App Signals during a Protected Auction is not supported with on-device auctions.

Illustration of the ad selection workflow.
The ad selection workflow in the Privacy Sandbox on Android.

The ad selection workflow is as follows:

  1. The seller's SDK sends the on-device encrypted payload of Protected App signals.
  2. The seller's server creates an auction configuration and sends it to the seller's Trusted Bidding and Auction service, along with the encrypted payload, to initiate an ad selection workflow.
  3. The seller's Bidding and Auction service passes the payload to the participating trusted buyers frontend servers.
  4. The buyer's bidding service executes buy-side ad selection logic
    1. Buy-side ad retrieval logic execution.
    2. Buy-side bidding logic execution.
  5. The sell-side scoring logic is executed.
  6. The ad is rendered and reporting is initiated.

Initiate ad selection workflow

When an application is ready to show an ad, the ad tech SDK (typically SSPs) initiates the ad selection workflow by sending any relevant contextual data from the publisher and the per-buyer encrypted payloads to be included in the request to be sent to the Protected Auction using the getAdSelectionData call. This is the same API used for the remarketing workflow and described in the Bidding And Auction Integration for Android proposal.

To initiate ad selection, the seller passes in a list of participating buyers and encrypted payload of the on device Protected App Signals. With this information, the sell side ad server prepares a SelectAdRequest for their trusted SellerFrontEnd service.

The seller sets the following:

  • The payload received from the getAdSelectionData, which contains the Protected App Signals.
  • The contextual signals using:

  • The list of buyers included in the auctions using the buyer_list field.

Buy-side ad selection logic execution

At a high level, the buyer's custom logic uses contextual data provided by the publisher and Protected App Signals to select and apply a bid to relevant ads for the ad request. The platform enables buyers to narrow down a large pool of available ads to the most relevant ones (the top k), for which bids are computed before the ads are returned to the seller for final selection.

Illustration of buy-side ad selection execution logic.
Buy-side ad selection execution logic in the Privacy Sandbox on Android.

Before bidding, buyers start with a large pool of ads. It is too slow to calculate a bid for each ad, so buyers first need to select the top k candidates from the large pool. Next, buyers need to calculate bids for each of those top k candidates. Then, those ads and bids are returned to the seller for the final selection.

  1. The BuyerFrontEnd service receives an ad request.
  2. The BuyerFrontEnd service sends a request to the buyer's bidding service. The buyer's bidding service runs a UDF called prepareDataForAdRetrieval(), which builds a request to get the top k candidates from the Ad Retrieval Service. The bidding service sends this request to the configured retrieval server endpoint.
  3. The Ad Retrieval Service runs the getCandidateAds() UDF, which filters down to the set of top k candidate ads, which are sent to the buyer's bidding service.
  4. The buyer's bidding service runs the generateBid() UDF, which picks the best candidate, calculates its bid, then returns it to the BuyerFrontEnd service.
  5. The BuyerFrontEnd service returns the ads and bids to the seller.

There are several important details about this flow – especially in regards to how the pieces talk to each other, and how the platform provides features like the ability to make machine learning predictions for retrieving the top k ads and calculating their bids.

Before we look at parts of this in more detail, there are some important architectural notes about the TEEs in the diagram above.

The buyer's bidding service internally contains an inference service. Ad techs can upload machine learning models to the buyer's bidding service. We will provide JavaScript APIs for ad techs to make predictions or generate embeddings from these models from within the UDFs running on the buyer's bidding service. Unlike the Ad Retrieval Service, the buyer's bidding service does not have a key value service to store any ads metadata.

The Ad Retrieval Service internally includes a key-value service. Ad techs can materialize key-value pairs into this service from their own servers, outside the privacy boundary. We will provide a JavaScript API for ad techs to read from this key-value service from within the UDFs running on the Ad Retrieval Service. Unlike the buyer's bidding service, the Ad Retrieval Service does not contain an inference service.

One central question this design addresses is how to make retrieval-time and bidding-time predictions. The answer for both can involve a solution called model factorization.

Model factorization

Model factorization is a technique that makes it possible to break a single model into multiple pieces, and then combine those pieces into a prediction. In the App Install use case, models often make use of three kinds of data: user data, contextual data, and ad data.

In the non-factorized case, a single model is trained on all three kinds of data. In the factorized case, we break the model up into multiple pieces. Only the piece containing user data is sensitive. That means only the model with the user piece needs to be run inside the trust boundary, on the buyer's bidding service's inference service.

That makes the following design possible:

  1. Break the model up into a private piece (the user data) and one or more non-private pieces (the contextual and ad data).
  2. Optionally, pass some or all of the non-private pieces as arguments to a UDF that needs to make predictions. For example, contextual embeddings are passed to UDFs in the per_buyer_signals.
  3. Optionally, ad techs can create models for non-private pieces, then materialize embeddings from those models into the Ad Retrieval Service's key-value store. UDFs on the Ad Retrieval Service can fetch these embeddings at runtime.
  4. To make a prediction during a UDF, combine private embeddings from the inference service with non-private embeddings from UDF function arguments or the key-value store with an operation like a dot product. This is the final prediction.

With that explained, we can look at each UDF in more detail. We'll explain what they do, how they integrate, and how they can make the predictions necessary to choose top k ads and calculate their bids.

The prepareDataForAdRetrieval() UDF

prepareDataForAdRetrieval() running on the buyer's bidding service is responsible for creating the request payload that will be sent to the ad retrieval service to fetch the top k candidate ads.

prepareDataForAdRetrieval() takes the following information:

  • The per-buyer payload received from the getAdSelectionData. This payload contains the Protected App Signals.
  • The contextual signals' auction_signals (for info about the auction) and buyer_signals (for buyers' signals fields).

prepareDataForAdRetrieval() does two things:

  • Featurization: if retrieval-time inference is needed, it transforms incoming signals into features for use during calls to the inference service to get private embeddings for retrieval.
  • Calculates private embeddings for retrieval: if retrieval predictions are needed, it makes the call against the inference service using the above features, and gets a private embedding for retrieval-time predictions.

prepareDataForAdRetrieval() returns:

  • Protected App Signals: ad tech-curated signals payload.
  • Auction-specific signals: platform-specific sell-side signals, and contextual information like auction_signals and per_buyer_signals (including contextual embeddings) from SelectAdRequest. This is similar to Protected Audiences.
  • Additional Signals: extra information like private embeddings retrieved from the inference service.

This request is sent to the Ad Retrieval Service, which does candidate matching and then runs the getCandidateAds() UDF.

The getCandidateAds() UDF

getCandidateAds() runs on the Ad Retrieval Service. It receives the request created by prepareDataForAdRetrieval() on the buyer's bidding service. The service executes getCandidateAds() which fetches the top-k candidates for bidding by converting the request into a series of set queries, data fetches, and executing custom business logic and other custom retrieval logic.

getCandidateAds() takes the following information:

  • Protected App Signals: ad tech-curated signals payload.
  • Auction-specific signals: platform-specific sell-side signals, and contextual information like auction_signals and per_buyer_signals (including contextual embeddings) from SelectAdRequest. This is similar to Protected Audiences.
  • Additional Signals: extra information like private embeddings retrieved from the inference service.

getCandidateAds() does the following:

  1. Fetch an initial set of ad candidates: Fetched using targeting criteria such as language, geo, ad type, ad size, or budget, to filter ad candidates.
  2. Retrieval embedding fetch: If embeddings from the key-value service are needed to make a retrieval-time prediction for top k selection, they must be retrieved from the key-value service.
  3. Top k candidate selection: Compute a lightweight score for the filtered set of ad candidates based on ad metadata fetched from the key-value store, and information sent from the buyer's bidding service and to pick top k candidates based on that score. For example, the score may be the chance of installing an app given the ad.
  4. Bidding embedding fetch: if embeddings from the key-value service are needed by bidding code to make bidding-time predictions, they may be retrieved from the key-value service.

Note that the score for an ad may be the output of a predictive model, which for example predicts the probability of a user installing an app. This kind of score generation involves a kind of model factorization: since getCandidateAds() runs on the Ad Retrieval Service, and since the ad retrieval service does not have an inference service, predictions may be generated by combining:

  • Contextual embeddings passed in using the auction-specific signals input.
  • Private embeddings passed in using the additional signals input.
  • Any non-private embeddings ad techs have materialized from their servers into the Ad Retrieval Service's key-value service.

Note that the generateBid() UDF that runs on the buyer's bidding service may also apply its own kind of model factorization to make its bidding predictions. If any embeddings are needed from a key-value service to do this, they must be fetched now.

getCandidateAds() returns:

  • Candidate ads: top k ads to be passed to generateBid(). Each ad is composed of:
    • Render URL: endpoint for rendering the ad creative.
    • Metadata: buy-side, ad tech-specific ads metadata. For example, this may include information about the ad campaign, and targeting criteria such as location and language. The metadata can include optional embeddings used when model factorization is needed to run inference during bidding.
  • Additional Signals: optionally, the Ad Retrieval Service can include extra information such as additional embeddings or spam signals to be used in generateBid().

We are investigating other methods for providing ads to be scored, including making it available as part of the SelectAdRequest call. These ads can be retrieved using an RTB bid request. Note that in such cases, ads must be retrieved without Protected App Signals. We anticipate that ad techs will evaluate tradeoffs before choosing their best option, including response payload size, latency, cost, and availability of signals.

The generateBid() UDF

Once you've retrieved the set of candidate ads and the embeddings during retrieval, you're ready to proceed to bidding, which runs in the buyer's bidding service. This service runs the buyer-supplied generateBid() UDF to select the ad to bid on from the top k, then return it with its bid.

generateBid() takes the following information:

  • Candidate ads: the top k ads returned by the retrieval Ad Retrieval service.
  • Auction-specific signals: platform-specific sell-side signals, and contextual information like auction_signals and per_buyer_signals (including contextual embeddings) from SelectAdRequest.
  • Additional signals: extra information to be used at bidding time.

The buyer's generateBid() implementation does three things:

  • Featurization: transforms signals into features for use during inference.
  • Inference: generates predictions using machine learning models to calculate values like predicted click-through and conversion rates.
  • Bidding: combining inferred values with other inputs to calculate the bid for a candidate ad.

generateBid() returns:

  • The candidate ad.
  • Its calculated bid amount.

Note that the generateBid() used for App Install Ads and the one used for Remarketing Ads are different.

The following sections describe featurization, inference, and bidding in more detail.

Featurization

Auction signals can be prepared by generateBid() into features. These features can be used during inference to predict things like click-through and conversion rates. We are also exploring privacy-preserving mechanisms to transmit some of them in the win report for use in model training.

Inference

While calculating a bid it is common to perform inference against one or more machine learning models. For example, effective eCPM calculations often use models to predict click-through and conversion rates.

Clients can supply a number of machine learning models along with their generateBid() implementation. We will also provide a JavaScript API within generateBid() so clients can perform inference at runtime.

Inference executes on the buyer's bidding service. This can affect inference latency and cost, especially since accelerators are not yet available in TEEs. Some clients will find their needs are met with individual models running on the buyer's bidding service. Some clients – for example, those with very large models – may want to consider options like model factorization to minimize inference cost and latency at bid time.

More information about inference capabilities like supported model formats and maximum sizes will be provided in a future update.

Implement model factorization

Earlier we explained model factorization. At bidding time, the specific approach is:

  1. Break the single model up into a private piece (the user data) and one or more non-private pieces (the contextual and ad data).
  2. Pass non-private pieces to generateBid(). These can come either from per_buyer_signals, or from embeddings that ad techs calculate externally, materialize into the retrieval service's key-value store, fetch at retrieval time, and return as part of the additional signals. This does not include private embeddings since those cannot be sourced from outside the privacy boundary.
  3. In generateBid():
    1. Perform inference against models to get private user embeddings.
    2. Combine private user embeddings with contextual embeddings from per_buyer_signals or non-private ad and contextual embeddings from the retrieval service using an operation like a dot product. This is the final prediction that can be used to calculate bids.

Using this approach, it is possible to perform inference at bidding time against models that would otherwise be too large or slow to execute on the buyer's bidding service.

Sell-side scoring logic

At this stage the ads with bids received from all participating buyers are scored. The output of generateBid() is passed to the seller's auction service to run scoreAd() and that scoreAd() considers only one ad at a time. Based on the scoring, the seller chooses a winning ad to return to the device for rendering.

The scoring logic is the same used for the Protected Audience remarketing flow and is able to select a winner amongst remarketing and app install candidates.The function gets called once for each submitted candidate ad in the Protected Auction. See the Bidding and Auction explainer for details.

Ad selection code runtime

In the proposal, the ad selection code for app install is specified in the same way as the for the Protected Audience remarketing flow. For details, see the Bidding and Auction configuration. The bidding code will be available in the same cloud storage location of the one used for remarketing.

Reporting

This proposal uses the same reporting APIs as the Protected Audience reporting proposal (for example, reportImpression(), which triggers the platform to send seller and buyer reports).

One common use case for reporting on the buy-side is getting the training data for models used during ad selection. In addition to existing APIs, the platform will provide a specific mechanism for egressing event-level data from the platform to ad tech servers. These egress payloads can include certain user data.

In the long term, we are investigating privacy-preserving solutions to address model training with data used in Protected Auctions without sending event-level user data outside services running on TEEs. We will provide additional details in a later update.

In the short term, we will provide a temporary way to egress noised data from generateBid(). Our initial proposal for this is below, and we will evolve it (including possible backward-incompatible changes) in response to industry feedback.

Technically, the way this works is:

  1. Ad techs define a schema for the data they want to transmit.
  2. In generateBid(), they build their desired egress payload.
  3. The platform validates the egress payload against the schema and enforces size limits.
  4. The platform adds noise to the egress payload.
  5. The egress payload is included in the win report in wire format, received on ad tech servers, decoded, and used for model training.

Defining the schema of egress payloads

For the platform to enforce evolving privacy requirements, egress payloads must be structured in a way the platform can understand. Ad techs will define the structure of their egress payloads by providing a schema JSON file. That schema will be processed by the platform, and will be kept confidential by the Bidding and Auction services using the same mechanisms as other ad tech resources like UDFs and models.

We will provide a CDDL file that defines the structure of that JSON. The CDDL file will include a set of supported feature types (for example, features that are booleans, integers, buckets, and so on). Both the CDDL file and the provided schema will be versioned.

For example, an egress payload that consists of a single boolean feature followed by a bucket feature of size two would look something like:

egressPayload = {
  features : [
    {
      type: "boolean_feature",
      value: false
    },
    {
      type: "bucket_feature",
      size: 2,
      value: [
        {
          value: false
        },
        {
          value: true
        }
      ]
    }
  ]
}

Details on the set of supported feature types are available on GitHub.

Build egress payloads in generateBid()

All Protected App Signals for a given buyer are available to their generateBid() UDF. Once these are featurized, ad techs create their payload in JSON format. This egress payload will be included in the buyer's win report for transmission to ad tech servers.

An alternative to this design is for egress vector calculation to happen in reportWin() instead of generateBid(). There are trade-offs for each approach, and we'll finalize this decision in response to industry feedback.

Validate the egress payload

The platform will validate any egress payload created by the ad tech. Validation ensures that feature values are valid for their types, size constraints are met, and that malicious actors are not attempting to defeat privacy controls by packing additional information into their egress payloads.

If an egress payload is invalid, it will be silently discarded from the inputs sent to the win report. This is because we don't want to provide debugging information to any bad actor attempting to defeat validation.

We will provide a JavaScript API for ad techs to ensure the egress payload they create in generateBid() will pass platform validation:

validate(payload, schema)

This JavaScript API is entirely for callers to determine if a particular payload will pass platform validation. Actual validation must be done in the platform to guard against malicious generateBid() UDFs.

Noise the egress payload

The platform will noise egress payloads before including them in the win report. The initial noise threshold will be 1%, and this value may evolve over time. The platform will provide no indication whether or not a particular egress payload has been noised.

The noising method is:

  1. The platform loads the schema definition for the egress payload.
  2. 1% of egress payloads will be chosen for noising.
  3. If an egress payload is not chosen, the entire original value is retained.
  4. If an egress payload is chosen, each feature's value will be replaced with a random valid value for that feature type (for example, either 0 or 1 for a boolean feature).

Transmitting, receiving, and decoding the egress payload for model training

The validated, noised egress payload will be included in the arguments to reportWin(), and transmitted to buyer ad tech servers outside the privacy boundary for model training. The egress payload will be in its wire format.

Details on the wire format for all feature types and for the egress payload itself are available on GitHub.

Determine the size of the egress payload

The size of the egress payload in bits balances utility and data minimization. We will work with industry to determine the appropriate size via experimentation. While we are running those experiments, we will temporarily egress data with no bit size limitation. That additional egress data with no bit size limitation will be removed once experiments are complete.

The method for determining size is:

  1. Initially, we will support two egress payloads in generateBid():
    1. egressPayload: the size-limited egress payload we've described so far in this document. Initially, this egress payload's size will be 0 bits (meaning it will always be removed during validation).
    2. temporaryUnlimitedEgressPayload: a temporary size-unlimited egress payload for size experiments. The formatting, creation, and processing of this egress payload uses the same mechanisms as egressPayload.
  2. Each of these egress payloads will have its own schema JSON file: egress_payload_schema.json and temporary_egress_payload_schema.json.
  3. We provide an experiment protocol and set of metrics for determining model utility at various egress payload sizes (for example, 5, 10, ... bits).
  4. Based on experiment results, we determine the egress payload size with the correct utility and privacy trade-offs.
  5. We set the size of egressPayload from 0 bits to the new size.
  6. After a set migration period, we remove temporaryUnlimitedEgressPayload, leaving only egressPayload with its new size.

We are investigating additional technical guardrails for managing this change (for example, encrypting egressPayload when we increase its size from 0 bits). Those details -- along with timing for the experiment and the removal of temporaryUnlimitedEgressPayload -- will be included in a later update.

Next we'll explain a possible experiment protocol for finalizing the size of egressPayload. Our goal is to work with industry to find a size that balances utility and data minimization. The artifact these experiments will produce is a graph where the x-axis is the size of the training payload in bits, and the y-axis is the percentage of revenue generated by a model at that size compared to a size-unlimited baseline.

We'll assume we're training a pInstall model, and our sources of training data are our logs and the contents of the temporaryUnlimitedegressPayloads we receive when we win auctions. The protocol for ad-techs first involves offline experiments:

  1. Determine the architecture of the models they will use with Protected App Signals. For example, they will need to determine whether or not they will use model factorization.
  2. Define a metric for measuring model quality. Suggested metrics are AUC loss and log loss.
  3. Define the set of features they will use during model training.
  4. Using that model architecture, quality metric, and set of training features, run ablation studies to determine the utility contributed per bit for each model they want to use in PAS. The suggested protocol for the ablation study is:
    1. Train the model with all features and measure utility; this is the baseline for comparison.
    2. For each feature used to produce the baseline, train the model with all features except that feature.
    3. Measure resulting utility. Divide the delta by the size of the feature in bits; this is the expected utility per bit for that feature.
  5. Determine the training payload sizes of interest for experimentation. We suggest [5, 10, 15, ..., size_of_all_training_features_for_baseline] bits. Each of these represents a possible size for egressPayload that the experiment will evaluate.
  6. For each possible size, select a set of features less than or equal to that size that maximize utility per bit, using the results of the ablation study.
  7. Train a model for each possible size and evaluate its utility as a percentage of the utility of the baseline model trained on all features.
  8. Plot the results on a graph where the x-axis is the size of the training payload in bits, and the y-axis is the percentage of revenue generated by that model compared to baseline.

Next, ad-techs can repeat steps 5-8 in live traffic experiments, using feature data sent via temporaryUnlimitedEgressPayload. Ad-techs can choose to share the results of their offline and live traffic experiments with Privacy Sandbox to inform the decision about the size of egressPayload.

The timeline for these experiments, as well as the timeline for setting the size of egressPayload to the resulting value, is beyond the scope of this document and will come in a later update.

Data protection measures

We will apply a number of protections to egressed data, including:

  1. Both egressPayload and temporaryUnlimitedEgressPayload will be noised.
  2. For data minimization and protection temporaryUnlimitedEgressPayload will be available only for the duration of size experiments, where we will determine the correct size for egressPayload.

Permissions

User control

  • The proposal intends to give users visibility to the list of installed apps that have stored at least one Protected App Signal or a custom audience.
  • Users can block and remove apps from this list. Block and removal does the following:
    • Clears all Protected App Signals and custom audiences associated with the app.
    • Prevents the apps from storing new Protected App Signals and custom audiences
  • Users have the ability to reset the Protected App Signals and Protected Audience API completely. When this happens, any existing Protected App Signals and custom audiences on the device are cleared.
  • Users have the ability to opt out completely from the Privacy Sandbox on Android, which includes the Protected App Signals API and Protected Audience API. When this is the case, the Protected Audience and Protected App Signals APIs return a standard exception message: SECURITY_EXCEPTION.

App permissions and control

The proposal intends to provide apps control over its Protected App Signals:

  • An app can manage its associations with Protected App Signals.
  • An app can grant third party ad tech platforms permissions to manage Protected App signals on its behalf.

Ad tech platform control

This proposal outlines ways for ad techs to control their Protected App Signals:

  • All ad techs must enroll with the Privacy Sandbox and provide a "site" or "origin" domain which matches all URLs for Protected App Signals.
  • Ad techs can partner with apps or SDKs to provide verification tokens that are used to verify creation of Protected App Signals. When this process is delegated to a partner, Protected App Signal creation can be configured to require acknowledgement by the ad tech.