您可以使用权重已知的现有线性回归模型或逻辑回归模型进行预测,而无需使用 ML.PREDICT
,即使无权访问模型本身也是如此。为此,您需要采用临时的变通方案,在广告数据中心的受众群体启用查询中使用差异化隐私 (DP) 回归模型。
此分步示例将介绍如何对实际线性和二元逻辑回归模型执行模拟推断,然后将结果与 ML.PREDICT
的结果进行比较,以显示模拟结果的准确性。此外,还将展示一个实际示例,说明如何使用二元逻辑模型创建受众群体名单,然后在将转化模型应用于受众群体启用时使用该名单。
示例简述:
- 生成数据
- 训练模型
- 获取权重和截距
- 模拟预测
- 比较结果
分步示例
1. 生成数据
创建一个表,其中包含用于训练模型的模拟数据。标记一定比例的行,用于隔离集。
线性回归
CREATE OR REPLACE TABLE DATASET_NAME.LIN_REG_TRAINING_SET AS
WITH
A AS (
SELECT
*
FROM
UNNEST(GENERATE_ARRAY(1, 100000)) AS row_number),
B AS (
SELECT
row_number,
RAND() AS rand_label,
RAND() AS rand_feature_1,
RAND() AS rand_feature_2,
RAND() AS rand_feature_3,
RAND() AS rand_feature_4,
RAND() AS rand_feature_5,
RAND() AS rand_feature_6,
RAND() AS rand_feature_7,
RAND() AS rand_feature_8,
RAND() AS rand_feature_9,
RAND() AS rand_feature_10
FROM
A),
C AS (
SELECT
rand_label AS label,
*
FROM
B),
D AS (
SELECT
row_number,
CAST(round(10 * label) AS INT64) AS label,
(rand_label + rand_feature_1) / 2 AS feature_1,
(rand_label + rand_feature_2) / 2 AS feature_2,
(rand_label + rand_feature_3) / 2 AS feature_3,
(rand_label + rand_feature_4) / 2 AS feature_4,
(rand_label + rand_feature_5) / 2 AS feature_5,
(rand_label + rand_feature_6) / 2 AS feature_6,
(rand_label + rand_feature_7) / 2 AS feature_7,
(rand_label + rand_feature_8) / 2 AS feature_8,
(rand_label + rand_feature_9) / 2 AS feature_9,
(rand_label + rand_feature_10) / 2 AS feature_10
FROM
C)
SELECT
label,
feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10,
RAND() < 0.1 AS holdback -- Ten percent will be true.
FROM
D
二元逻辑回归
SELECT
CASE
WHEN label < 5 THEN 0
WHEN label >= 5 THEN 1
END
AS label,
* EXCEPT (label)
FROM
`DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
2. 训练模型
使用训练集训练回归模型。
线性回归
CREATE OR REPLACE MODEL `DATASET_NAME.LIN_REG_MODEL` OPTIONS (model_type="linear_reg") AS
SELECT
* except (holdback)
FROM
`DATASET_NAME.LIN_REG_TRAINING_SET`
WHERE
NOT holdback
请注意,我们已为模拟数据添加足够的噪声,以获取 R2 = 0.9009 的模型。
衡量指标 | 值 |
---|---|
平均绝对误差 | 0.7359 |
均方误差 | 0.8432 |
均方根对数误差 | 0.0810 |
绝对中位差 | 0.6239 |
R 的平方 | 0.9009 |
二元逻辑回归
CREATE OR REPLACE MODEL `DATASET_NAME.BIN_LOG_REG_MODEL` OPTIONS (model_type="logistic_reg") AS
SELECT
* EXCEPT (holdback)
FROM
`DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
WHERE
NOT holdback
以下为示例结果。请注意,准确率为 0.9260。
衡量指标 | 值 |
---|---|
正类别 | 1 |
负类别 | 0 |
精确率 | 0.0810 |
召回率 | 0.9315 |
准确率 | 0.9260 |
F1 得分 | 0.9328 |
此混淆矩阵中的粗体值表示模型正确分类每个标签的频率,非粗体值则表示模型错误分类每个标签的频率。
真正例标签 | 预测的标签 1 | 预测的标签 2 |
---|---|---|
1 | 93% | 7% |
0 | 8% | 92% |
3. 获取权重和截距
获取模型的权重和截距:
线性回归
SELECT
*
FROM
ML.WEIGHTS(MODEL `DATASET_NAME.LIN_REG_MODEL`)
weight | category_weights.category |
---|---|
feature_1 | 1.8263055528635743 |
feature_2 | 1.8143804404490813 |
feature_3 | 1.8601204874033492 |
feature_4 | 1.8507603439031859 |
feature_5 | 1.7899764387123640 |
feature_6 | 1.8645246630251291 |
feature_7 | 1.8698005281925356 |
feature_8 | 1.7904637080330201 |
feature_9 | 1.8036887855406274 |
feature_10 | 1.8117115890624449 |
INTERCEPT | -4.1428754911504306 |
二元逻辑回归
SELECT
*
FROM
ML.WEIGHTS(MODEL `DATASET_NAME.BIN_LOG_REG_MODEL`)
weight | category_weights.category |
---|---|
feature_1 | 3.823533928 |
feature_2 | 3.734812819 |
feature_3 | 3.842239823 |
feature_4 | 3.785488823 |
feature_5 | 3.737386716 |
feature_6 | 3.567663961 |
feature_7 | 3.819643052 |
feature_8 | 3.734673763 |
feature_9 | 3.839301406 |
feature_10 | 3.787306994 |
INTERCEPT | -17.922169920 |
4. 模拟预测
线性回归
使用特征值与权重的点积,然后加上截距,以使用标准 SQL 进行预测,而无需使用 ML.PREDICT
。此查询会比较采用这种方法的预测结果与采用 ML.PREDICT
的预测结果。请注意粗体 SQL 语句行如何计算行特征值与模型权重的点积,然后加上截距。
WITH
T AS (
SELECT
label AS actual_label,
predicted_label AS ml_predicted_label,
[feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10] AS features,
[1.8263055528635743,
1.8143804404490813,
1.8601204874033492,
1.8507603439031859,
1.789976438712364,
1.8645246630251291,
1.8698005281925356,
1.7904637080330201,
1.8036887855406274,
1.8117115890624449] AS weights
FROM
ML.PREDICT(MODEL `DATASET_NAME.LIN_REG_MODEL`,
(
SELECT
*
FROM
`PROJECT_NAME.DATASET_NAME.LIN_REG_TRAINING_SET`))
WHERE
holdback),
P AS (
SELECT
actual_label,
ml_predicted_label,
(
SELECT
SUM(element1 * element2) - 4.1428754911504306
FROM
T.features element1
WITH
OFFSET
pos
JOIN
T.weights element2
WITH
OFFSET
pos
USING
(pos) ) sql_predicted_label,
features,
weights
FROM
T)
SELECT
actual_label,
ml_predicted_label,
sql_predicted_label,
ABS(ml_predicted_label - sql_predicted_label) < 0.00000000001 AS diff_is_negligible
FROM
P
二元逻辑回归
二元逻辑回归模拟预测的方法与线性回归非常相似,只是在最后一步会使用所需阈值来应用 S 型函数。
使用特征值与权重的点积,然后加上截距,以使用标准 SQL 进行预测,而无需使用 ML.PREDICT
。然后,对结果应用 S 型函数并将阈值设为 0.5,以预测 0 或 1。此查询会比较采用这种方法的预测结果与采用 ML.PREDICT
的预测结果。
WITH
T AS (
SELECT
label AS actual_label,
predicted_label AS ml_predicted_label,
[feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10] AS features,
[3.8235339279050287,
3.7348128191185244,
3.8422398227859471,
3.7854888232502479,
3.7373867156553713,
3.5676639605351026,
3.8196430517007811,
3.7346737628343032,
3.8393014063170749,
3.7873069939244743] AS weights
FROM
ML.PREDICT(MODEL `DATASET_NAME.BIN_LOG_REG_MODEL`,
(
SELECT
*
FROM
`PROJECT_NAME.DATASET_NAME.BIN_LOG_REG_TRAINING_SET`))
WHERE
holdback),
P AS (
SELECT
actual_label,
ml_predicted_label,
(
SELECT
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
FROM
T.features element1
WITH
OFFSET
pos
JOIN
T.weights element2
WITH
OFFSET
pos
USING
(pos) ) sql_predicted_label,
features,
weights
FROM
T)
SELECT
actual_label,
ml_predicted_label,
sql_predicted_label,
ml_predicted_label = sql_predicted_label AS simulation_is_accurate
FROM
P
以上查询中以粗体显示的 SQL 代码块会计算每行特征值与模型权重的点积,然后加上截距,以获取线性回归预测结果:
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
然后,它会使用标准 SQL 将 S 型函数 Y = 1 / (1+e^-z)
应用于点积和截距:
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
最后,将 S 型函数的结果与阈值 0.5 进行比较,得出二元逻辑回归预测值 0(如果小于 0.5)或 1(如果大于或等于 0.5)。请注意,您可以使用 0 到 1 之间的任何阈值。
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
这种方法也可扩展到多类别逻辑回归。在这种情况下,模型的权重将是一个 nxn 矩阵,而不是向量,并且权重将是一个向量,而不是标量。您需要将特征值向量乘以权重矩阵,然后加上截距向量。所得到的向量将具有每个标签的得分,并且您可以为预测选择分数最高的标签。如果想返回一个概率数组,则需要将 S 型函数应用到数组的每个元素。
5. 比较结果
线性回归
示例结果几乎完全相同,只有微小的舍入误差。
actual_label | ml_predicted_label | sql_predicted_label | diff_is_negligible |
---|---|---|---|
6 | 5.2062349420751834 | 5.2062349420751826 | true |
0 | 0.40318472770048075 | 0.403184727700479 | true |
3 | 3.0703766078249597 | 3.0703766078249597 | true |
7 | 7.0588171538562 | 7.0588171538562 | true |
6 | 6.7802375930646 | 6.7802375930646 | true |
6 | 5.1088569571339368 | 5.1088569571339377 | true |
4 | 4.051839078116874 | 4.051839078116874 | true |
4 | 5.1810254680219243 | 5.1810254680219234 | true |
6 | 6.1440349466401223 | 6.1440349466401205 | true |
1 | 2.0842399472783519 | 2.0842399472783519 | true |
2 | 2.1911209811886847 | 2.1911209811886838 | true |
3 | 3.0236086790006622 | 3.0236086790006613 | true |
2 | 2.573083132964213 | 2.5730831329642125 | true |
7 | 5.68662973136732 | 5.6866297313673186 | true |
9 | 8.1860026312677938 | 8.1860026312677938 | true |
二元逻辑回归
通过将模拟推断与 ML.PREDICT
的实际结果相比较,可以看到效果非常完美,在 10,000 行隔离集中没有一项矛盾。在为数不多的几行中,ML.PREDICT
和模拟推断都与实际标签不一致,这属于正常现象,因为模型准确率约为 93%,混淆矩阵的非对角线单元中有些值虽然很小但并非为零。
actual_label | ml_predicted_label | sql_predicted_label | simulation_is_accurate |
---|---|---|---|
0 | 1 | 1 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 0 | 0 | true |
0 | 1 | 1 | true |
0 | 0 | 0 | true |
利用机器学习技术创建受众群体启用名单
典型应用场景是创建一个差异化隐私二元逻辑回归模型来预测转化,然后在创建受众群体名单时对此模型应用推断。假设在上例中创建的二元逻辑模型是根据模型来估算转化的,并且训练集和评估集中的每一行都代表一个不同的用户。
以下查询展示了如何创建一个受众群体名单,使其包含根据模型预测将会转化的用户:
WITH
T AS (
SELECT
*,
label AS actual_label,
[feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10] AS features,
[3.8235339279050287,
3.7348128191185244,
3.8422398227859471,
3.7854888232502479,
3.7373867156553713,
3.5676639605351026,
3.8196430517007811,
3.7346737628343032,
3.8393014063170749,
3.7873069939244743] AS weights
FROM
`PROJECT_NAME.DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
WHERE
holdback),
P AS (
SELECT
*,
(
SELECT
IF
((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
FROM
T.features element1
WITH
OFFSET
pos
JOIN
T.weights element2
WITH
OFFSET
pos
USING
(pos) ) predicted_label,
features,
weights
FROM
T),
SELECT
user_id
FROM
P
WHERE
predicted_label = 1;