Noise injection is a technique used to protect user privacy when querying a
database. It works by adding random noise to an aggregating SELECT
clause of a
query. This noise protects user privacy while providing reasonably accurate
results, eliminating the need for difference checks, and reducing the required
aggregation threshold for output. Most existing queries can be executed in noise
mode, with some limitations.
Learn the benefits of using noise injection
Difference checks do not apply: When running queries with noise injection, Ads Data Hub does not filter rows due to similarity to previous result sets. This means that you can still get a holistic view of the data while protecting user privacy.
Troubleshooting is simplified: Rows are only omitted due to aggregation requirements, making it simpler to troubleshoot and adapt queries.
There is no new syntax to learn: You don't need to learn any new query syntax or be versed in privacy concepts to use noise instead of difference checks.
Result accuracy is clear: A successful job shows the total percentage of data with the expected amount of noise.
Learn how noise impacts privacy requirements
Difference checks: Noise injection does not rely on existing difference checks in Ads Data Hub. When you use noise injection, difference checks are disabled.
Aggregation requirement: Noise injection outputs impression data represented by approximately 20 or more unique users, and click or conversion data represented by approximately 10 or more unique users.
Static checks: No impact.
Budgets and query limits: Queries executed using noise share the data access budget used with difference checks. As with difference checks, if you execute the same query on the same dataset many times, you might lose access to frequently queried dates in the dataset. This can happen if you run sliding window queries, or if you make the same request multiple times.
Noise mode imposes additional, stricter limits on recomputing the same aggregate results within or across queries. As with the data access budget, you can lose access to frequently queried dates in the dataset; but limitations due to recomputing the same aggregate results will only restrict queries in noise mode, not queries in difference check mode. For more information, see Repeated results.
Learn more about privacy checks.
Understand how noise injection affects results
Ads Data Hub injects noise to mitigate disclosure risk—the risk that someone can learn information about an individual user. It balances privacy against utility.
Noise injection in Ads Data Hub transforms the query results as follows:
- It clamps outlier users' contributions in aggregate results. It sums each user's contribution in each aggregation and then caps each contribution with minimum and maximum clamping bounds.
- It aggregates the clamped per-user contributions.
- It adds noise to each aggregate result—the result of each aggregation function call in each row. The scale of this random noise is proportional to the clamped bounds.
- It computes a noisy user count for each row and eliminates rows with too few users. This is similar to k-anonymity in difference check mode, but because of the noise, jobs running on the same dataset can drop different rows. Also, noise mode drops fewer rows because the aggregation requirement is lower (approximately 20 versus exactly 50).
The final result is a dataset where each row has noisy aggregate results and small groups have been eliminated. This masks an individual user's effect on returned results.
About aggregation clamping
Noise injection in Ads Data Hub uses implicit or explicit aggregation clamping to limit the contribution of outliers. You can choose which type of clamping to use, depending on your use case.
Implicit clamping
In implicit clamping, bounds are determined automatically. You don't need any special SQL syntax to use implicit clamping. If one row has a wider range of values than another, implicit bounding finds different bounds for these rows. This typically gives a lower margin of error for each result. On the other hand, each aggregation gets different clamping bounds and noise levels, which can make them hard to compare.
Implicit clamping can fail when an aggregation gets data from too few users—for
example, a COUNTIF()
call with a rare condition. These cases return NULL
results. Also note that COUNT(DISTINCT user_id)
automatically uses explicit
clamping with bounds of 0
and 1
.
Explicit clamping
Explicit clamping clamps the total contribution from each user to a specified range. Explicit bounds are uniformly applied to all rows and must be literal values. Even if some rows have a wider range of per-user contributions than others, the same bounds are applied to all of them. This makes results from different rows more comparable, though some rows get more noise than they might with implicit clamping.
Explicit clamping uses half as much noise as implicit clamping, for a given set of clamping bounds. Therefore, if you can estimate reasonable bounds, you will get better results by setting them explicitly.
To use explicit clamping, set the bounds for each supported aggregate function by adding integers representing the lower bound and the upper bound. For example:
SELECT
campaign_name,
-- Set lower and upper bounds to 0 and 1, respectively
ADH.ANON_COUNT(*, contribution_bounds_per_group => (0,1))
FROM data
GROUP BY 1
Run a query using noise injection
- Open a report.
- Click the Privacy noise settings toggle to the Use noise position.
- Run the query.
- Review the impact of the added noise.
- Optional: Adapt the query to reduce noise impact.
Review noise impact
Once job completes successfully, Ads Data Hub displays the reliability of the result in the privacy summary. Reliability is based on the percentage of cells in the output that are highly impacted by noise. A value in the result table is considered highly impacted if the scale of the added noise is greater than 5% of the result in the cell.
For impacted output data sets, the privacy summary lists the ten noisiest columns from highest to lowest impact and their corresponding contribution to noise. This is the breakdown of the expected amount of noise.
Results with >5% noise | Indicator color | Impact |
---|---|---|
<5% | Green | Low impact |
5%-15% | Yellow | Medium impact |
15%-25% | Orange | High impact |
>25% | Red | Very high impact |
You can also preview the privacy summary for recent report jobs on the Home page. To preview the privacy for a particular job, hold the pointer over the privacy tip icon privacy_tip in the job card under Recent activity.
Adapt queries
Aggregate results are more likely to have an unexpected amount of noise when few
users contribute to those results. This can happen when rows have few users or
when some users don't affect results—for example, when using the COUNTIF
function. Based on the noise details, you may want to adjust your query to
increase the percentage of data with the expected amount of noise.
The following are general guidelines:
- Expand the date range.
- Rewrite the query to reduce the granularity of the data, such as by grouping
by fewer parameters or replacing
COUNTIF
withCOUNT
. - Remove noisy columns.
- Use explicit clamping.
Supported aggregate functions
The following aggregate functions are supported with noise:
SUM(...)
COUNT(*)
COUNT(...)
COUNTIF(...)
COUNT(DISTINCT user_id)
APPROX_COUNT_DISTINCT(user_id)
AVG(...)
The DISTINCT
keyword is only supported with the COUNT
function, and only
when used with a direct reference to the user_id
column from an
Ads Data Hub table or an expression that returns either user_id
or
NULL
, such as COUNT(DISTINCT IF(..., user_id, NULL))
.
The following functions are not directly supported, but can be replaced with other aggregates with noise to get statistical results. Note that the numeric values are examples only:
LOGICAL_OR(...)
. Suggested replacement:COUNT(DISTINCT IF(..., user_id, NULL)) > 0
LOGICAL_AND(...)
. Suggested replacement:COUNT(DISTINCT IF(NOT ..., user_id, NULL)) <= 0
About integer results
Although Ads Data Hub will automatically inject noise for these
aggregation functions, the function signatures do not change. Because functions
like COUNT
or SUM
of INT64
return INT64
, any decimal part of the noised
result is rounded. This is usually negligible relative to the size of the result
and noise.
If you need the granularity of the decimal in your result, then avoid writing
functions that return INT64
–for example, by using SUM
with its input cast to
FLOAT64
.
Supported query patterns
Important: Most of Ads Data Hub's standard best practices still apply to queries that use noise injection. In particular, we recommend that you review the guidance on repeatedly querying the same data.
This section describes query patterns that are supported when running queries using noise injection.
User-level aggregates
Unrestricted user-level aggregates are supported in the same way that they are
in difference check mode. Noise is only injected in aggregations that combine
data across multiple users. Aggregations that explicitly group by user_id
, or
analytic functions that partition by user_id
, don't receive any noise and any
function is allowed. User-level aggregations that don't explicitly group by
user_id
–for example, GROUP BY impression_id
, are treated as cross-user
aggregations, so noise is added.
Grouping by external_cookie is not enough. While external_cookie can be used to join *_match tables with customer-owned tables, any single-user aggregations should explicitly group by user_id column, not just the external_cookie column.
Aggregate function example:
WITH user_paths AS (
# Grouping by user_id, no noise needed, all functions allowed
SELECT user_id, STRING_AGG(campaign_id, ">" ORDER BY query_id.time_usec) AS path
FROM adh.google_ads_impressions
GROUP BY 1
)
# Noise applied here to num_users
SELECT path, COUNT(*) AS num_users
FROM user_paths
GROUP BY 1;
Analytic function example:
WITH events AS (
# Partitioning by user_id, no noise needed, all functions allowed
SELECT
campaign_id,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY query_id.time_usec) AS index
FROM adh.google_ads_impressions
)
# Noise applied here to first_impressions
SELECT campaign_id, COUNT(*) AS first_impressions
FROM events
WHERE index = 1
GROUP BY 1;
Parallel aggregates
Each cross-user aggregation receives noise independently. You can run multiple
such aggregations in a single statement, combining results into one table using
a JOIN
or UNION
.
Example:
WITH result_1 AS (
# Noise applied here to num_impressions
SELECT campaign_id, COUNT(*) AS num_impressions
FROM adh.google_ads_impressions
GROUP BY 1
), result_2 AS (
# Noise applied here to num_clicks
SELECT campaign_id, COUNT(*) AS num_clicks
FROM adh.google_ads_clicks
GROUP BY 1
)
SELECT * FROM result_1 JOIN result_2 USING(campaign_id)
Note that this would be supported but should be avoided in difference check mode. This practice is not a problem with noise, as each parallel aggregate is noised and filtered independently.
Aggregated data joined with unaggregated data
Since Ads Data Hub only supports analytic windows that partition by
user_id
, it is a common workaround to aggregate these results separately and
self-join them before aggregating again. These queries are supported in noise
mode, and often perform better than they would with in difference check mode due
to privacy requirements being resolved earlier.
Example:
WITH campaign_totals AS (
# Noise applied here to campaign_imps
SELECT campaign_id, COUNT(*) AS campaign_imps
FROM adh.google_ads_impressions
GROUP BY 1
)
# Noise applied here to imps
SELECT campaign_id, demographics, campaign_imps, COUNT(*) AS imps
FROM adh.google_ads_impressions JOIN campaign_totals USING(campaign_id)
GROUP BY 1,2,3
Noise mode prohibits reaggregating aggregate results,
such as AVG(campaign_imps)
.
Unsupported query patterns
This section describes query patterns that are not supported when running queries using noise injection.
Today-inclusive queries
Noise mode queries do not support querying the current day's data. (This is discouraged in difference check mode.) The current date is not selectable for queries that use noise injection.
Repeated results
In noise mode, Ads Data Hub limits how often you can repeat the same aggregation. If you reach these limits, your noise mode queries will lose access to frequently queried dates in the dataset. The following are examples of how this can occur.
Query repetition happens when the same query is run multiple times with the same parameters or highly similar parameters, such as overlapping date ranges. You can avoid this by using data that is already exported to your BigQuery project.
Note that if two jobs are querying overlapping date ranges, they might produce repetitions if performing the same computation on the same users. For example, the following query, executed on overlapping date ranges, creates repetitions because it partitions by date:
SELECT DATE(TIMESTAMP_MICROS(event.event_time)) AS date,
COUNT(*) AS cnt
FROM adh.cm_dt_clicks
GROUP BY 1
In this case, you should run the query on disjoint date segments.
Another example of a repetition happens when data is somewhat date independent. The following query produces repetitions when executed on overlapping dates, where both jobs cover the entire lifetime of a campaign:
SELECT campaign_id, COUNT(*) AS cnt
FROM adh.google_ads_impressions
GROUP BY 1
In this case, you should run this query only once since the result doesn't change.
Aggregation repetition happens when the same aggregation is repeated multiple times within a query:
SELECT COUNT(*) AS cnt1, COUNT(*) AS cnt2
FROM table
In this case, you should remove one of the repetitions.
Note that even if the aggregations are syntactically different but compute the
same value, it would count as a repetition. In other words, if the values of
condition1
and condition2
are the same for all users with some value of
key
, the following query would have a repetition:
SELECT key, COUNTIF(condition1) AS cnt1, COUNTIF(condition2) AS cnt2
FROM table
GROUP BY key
If you have conditions that are very similar for some groups of users, you might
consider rewriting the query to have only one COUNT
.
Row duplication happens when an Ads Data Hub table is joined with a BigQuery
table in a way that each row from the Ads Data Hub table matches multiple rows
in the BigQuery table. For example, the following query produces a repetition if
there are multiple rows with the same campaign ID in bq_table
:
SELECT r.campaign_id, COUNT(*) AS cnt
FROM adh_table
INNER JOIN bq_table ON l.campaign_id = r.campaign_id
In this case, you should restructure the query so that bq_table
would have
only one row per join key value (campaign_id
, in this case).
Note that unnesting an array from the Ads Data Hub table could produce the same effect if most users have the same arrays of values:
SELECT in_market_id, COUNT(*)
FROM adh.dv360_youtube_impressions,
UNNEST(in_market) AS in_market_id
GROUP BY 1
Learn about other query best practices.
Direct reaggregation
Noise is applied to the first layer of cross-user aggregation in the query. Queries with multiple layers of aggregation are blocked:
WITH layer_1 AS (
# Noise applied here to partial_result
SELECT campaign_id, demographics, location, COUNT(*) AS partial_result
FROM adh.google_ads_impressions
GROUP BY 1,2,3
HAVING partial_result > 5
)
# Reaggregation of partial_result with no user-level data, will be rejected
SELECT campaign_id, SUM(partial_result) AS final_result
FROM layer_1
GROUP BY 1
To get the best results from noise, compute all cross-user operations within a
single aggregation. For example, take a SUM
of events rather than a SUM
of
intermediate counts. It is possible to rewrite a query to reaggregate noised
aggregates, but the final aggregates may have much higher noise.
If this is unavoidable, you can rewrite your query to export results directly
from the first layer instead. To do this within a single job without changing
script results, create a temp table (or a table exported to your BigQuery
project) with the OPTIONS(privacy_checked_export=true)
syntax. For example:
CREATE TEMP TABLE layer_1 OPTIONS(privacy_checked_export=true) AS (
# Noise applied here to partial_result
SELECT campaign_id, demographics, location, COUNT(*) AS partial_result
FROM adh.google_ads_impressions
GROUP BY 1,2,3
HAVING partial_result > 5
);
# Reaggregation of privacy checked data, no noise needed
SELECT campaign_id, SUM(partial_result) AS final_result
FROM layer_1
GROUP BY 1
If the first layer of aggregation is too granular for privacy checks, consider rewriting the query with user-level aggregates. If this is not possible, then this query is not supported in noise mode.
Unjoined user IDs
Queries in noise mode must not combine data from separate users into a single
row, except when performing an aggregation with noise. As a result, joins of
unaggregated Ads Data Hub data are required to explicitly join on the
user_id
column.
This query does not explicitly join on the user_id
column, which results in a
validation error:
SELECT …
FROM adh.google_ads_impressions
JOIN adh.google_ads_clicks USING(impression_id)
This can be fixed by adjusting the USING
clause to explicitly include
user_id
–for example, USING(impression_id, user_id)
.
Note that this limitation applies only to joins between Ads Data Hub tables (with the exception of dimension tables). It does not apply to customer-owned tables. For example, the following is allowed:
SELECT …
FROM adh.google_ads_impressions
JOIN bigquery_project.dataset.table USING(any_column)
Ads Data Hub-BigQuery unions
Aggregations with noise require user identifiers to perform well. Customer-owned data in BigQuery has no user identifiers, so it cannot be unioned into a noise aggregation without joining to an Ads Data Hub table.
This query results in a validation error:
SELECT COUNT(*) FROM (
SELECT 1 FROM adh.google_ads_impressions
UNION ALL
SELECT 1 FROM bigquery_project.dataset.table
)
To fix this, you should either join the BigQuery table to augment Ads Data Hub data rather than union, or separate the data to aggregate each source separately.
Note that it is fine to take a union between multiple Ads Data Hub tables with user data, or multiple customer-owned BigQuery tables, but you cannot mix the two.
Ads Data Hub-BigQuery right joins
Outer joins with customer-owned data can lead to rows with missing user identifiers, which prevents noise from working well.
Both of these queries result in validation errors because they allow for unmatched rows with missing user identifiers on the Ads Data Hub side:
SELECT …
FROM adh.google_ads_impressions
RIGHT JOIN bigquery_project.dataset.table USING(column)
SELECT …
FROM bigquery_project.dataset.table
LEFT JOIN adh.google_ads_impressions USING(column)
Note that either join would work if the order of the tables was reversed.
Filtered Row Summary
The filtered row summary spec is not supported in noise mode. This feature is most often unnecessary with noise due to lower filtering rates and the lack of filtering from difference checks.
If you observe significant data filtering in a noise result, then increase the aggregated data. You may perform a parallel aggregation over the full dataset to compare an estimate of the total, for example:
SELECT campaign_name, COUNT(*)
FROM data
GROUP BY 1
UNION ALL
SELECT 'Total', COUNT(*)
FROM data
GROUP BY 1
Note that the total count is independently noised and total values may not add up, but the total count is often more accurate than taking the sum of noised rows.
Cross-mode created tables
Unexported tables in Ads Data Hub can only be used with the same privacy mode where they were created. You can't create a table in normal aggregation mode and use it in noise mode, or the other way around (unless that table is exported to BigQuery first).