您可以使用現有的線性或邏輯迴歸模型,搭配已知權重進行預測,而不使用 ML.PREDICT
(且即使無法存取該模型也可預測)。在此情況下,您必須採行因應做法,在廣告資料中心的目標對象啟用查詢中使用差異化隱私 (DP) 迴歸模型。
這個逐步範例將說明如何進行模擬推論,取得實際的線性和二元邏輯迴歸模型,然後與 ML.PREDICT
的結果進行比較,得出模擬結果的準確率。此外,本例也會實際示範如何使用二元邏輯模型建立目標對象名單,之後對目標對象啟用功能套用轉換模型時,這份名單就能派上用場。
範例總覽:
- 產生資料
- 訓練模型
- 取得權重和截距
- 模擬預測
- 比較結果
逐步範例
1. 產生資料
使用模擬資料建立資料表,用來訓練模型,並將一部分資料列用於區隔劃分集。
線性迴歸
CREATE OR REPLACE TABLE DATASET_NAME.LIN_REG_TRAINING_SET AS
WITH
A AS (
SELECT
*
FROM
UNNEST(GENERATE_ARRAY(1, 100000)) AS row_number),
B AS (
SELECT
row_number,
RAND() AS rand_label,
RAND() AS rand_feature_1,
RAND() AS rand_feature_2,
RAND() AS rand_feature_3,
RAND() AS rand_feature_4,
RAND() AS rand_feature_5,
RAND() AS rand_feature_6,
RAND() AS rand_feature_7,
RAND() AS rand_feature_8,
RAND() AS rand_feature_9,
RAND() AS rand_feature_10
FROM
A),
C AS (
SELECT
rand_label AS label,
*
FROM
B),
D AS (
SELECT
row_number,
CAST(round(10 * label) AS INT64) AS label,
(rand_label + rand_feature_1) / 2 AS feature_1,
(rand_label + rand_feature_2) / 2 AS feature_2,
(rand_label + rand_feature_3) / 2 AS feature_3,
(rand_label + rand_feature_4) / 2 AS feature_4,
(rand_label + rand_feature_5) / 2 AS feature_5,
(rand_label + rand_feature_6) / 2 AS feature_6,
(rand_label + rand_feature_7) / 2 AS feature_7,
(rand_label + rand_feature_8) / 2 AS feature_8,
(rand_label + rand_feature_9) / 2 AS feature_9,
(rand_label + rand_feature_10) / 2 AS feature_10
FROM
C)
SELECT
label,
feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10,
RAND() < 0.1 AS holdback -- Ten percent will be true.
FROM
D
二元邏輯迴歸
SELECT
CASE
WHEN label < 5 THEN 0
WHEN label >= 5 THEN 1
END
AS label,
* EXCEPT (label)
FROM
`DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
2. 訓練模型
根據訓練集訓練迴歸模型。
線性迴歸
CREATE OR REPLACE MODEL `DATASET_NAME.LIN_REG_MODEL` OPTIONS (model_type="linear_reg") AS
SELECT
* except (holdback)
FROM
`DATASET_NAME.LIN_REG_TRAINING_SET`
WHERE
NOT holdback
請注意,我們已在模擬資料中加入足夠的雜訊,取得 R2 = 0.9009 的模型。
評估 | 值 |
---|---|
平均絕對誤差 | 0.7359 |
均方誤差 | 0.8432 |
均方對數誤差 | 0.0810 |
中位數絕對誤差 | 0.6239 |
R 平方 | 0.9009 |
二元邏輯迴歸
CREATE OR REPLACE MODEL `DATASET_NAME.BIN_LOG_REG_MODEL` OPTIONS (model_type="logistic_reg") AS
SELECT
* EXCEPT (holdback)
FROM
`DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
WHERE
NOT holdback
範例結果。請注意,準確率為 0.9260。
評估 | 值 |
---|---|
正類 | 1 |
負類 | 0 |
精確度 | 0.0810 |
喚回度 | 0.9315 |
準確率 | 0.9260 |
F1 分數 | 0.9328 |
這個混淆矩陣中的粗體值代表模型正確分類每個標籤的頻率,非粗體值則是模型將每個標籤分類錯誤的頻率。
實際標籤 | 預測標籤 1 | 預測標籤 2 |
---|---|---|
1 | 93% | 7% |
0 | 8% | 92% |
3. 取得權重和截距
取得模型的權重和截距:
線性迴歸
SELECT
*
FROM
ML.WEIGHTS(MODEL `DATASET_NAME.LIN_REG_MODEL`)
權重 | category_weights.category |
---|---|
feature_1 | 1.8263055528635743 |
feature_2 | 1.8143804404490813 |
feature_3 | 1.8601204874033492 |
feature_4 | 1.8507603439031859 |
feature_5 | 1.7899764387123640 |
feature_6 | 1.8645246630251291 |
feature_7 | 1.8698005281925356 |
feature_8 | 1.7904637080330201 |
feature_9 | 1.8036887855406274 |
feature_10 | 1.8117115890624449 |
截距 | -4.1428754911504306 |
二元邏輯迴歸
SELECT
*
FROM
ML.WEIGHTS(MODEL `DATASET_NAME.BIN_LOG_REG_MODEL`)
權重 | category_weights.category |
---|---|
feature_1 | 3.823533928 |
feature_2 | 3.734812819 |
feature_3 | 3.842239823 |
feature_4 | 3.785488823 |
feature_5 | 3.737386716 |
feature_6 | 3.567663961 |
feature_7 | 3.819643052 |
feature_8 | 3.734673763 |
feature_9 | 3.839301406 |
feature_10 | 3.787306994 |
截距 | -17.922169920 |
4. 模擬預測
線性迴歸
搭配權重使用特徵值的內積並加上截距,使用標準 SQL 進行預測,而不使用 ML.PREDICT
。這項查詢會將採用該技巧的預測結果與使用 ML.PREDICT
的預測結果做比較。請注意,以粗體顯示的 SQL 行會以模型權重算出該列特徵值的內積,然後加上截距。
WITH
T AS (
SELECT
label AS actual_label,
predicted_label AS ml_predicted_label,
[feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10] AS features,
[1.8263055528635743,
1.8143804404490813,
1.8601204874033492,
1.8507603439031859,
1.789976438712364,
1.8645246630251291,
1.8698005281925356,
1.7904637080330201,
1.8036887855406274,
1.8117115890624449] AS weights
FROM
ML.PREDICT(MODEL `DATASET_NAME.LIN_REG_MODEL`,
(
SELECT
*
FROM
`PROJECT_NAME.DATASET_NAME.LIN_REG_TRAINING_SET`))
WHERE
holdback),
P AS (
SELECT
actual_label,
ml_predicted_label,
(
SELECT
SUM(element1 * element2) - 4.1428754911504306
FROM
T.features element1
WITH
OFFSET
pos
JOIN
T.weights element2
WITH
OFFSET
pos
USING
(pos) ) sql_predicted_label,
features,
weights
FROM
T)
SELECT
actual_label,
ml_predicted_label,
sql_predicted_label,
ABS(ml_predicted_label - sql_predicted_label) < 0.00000000001 AS diff_is_negligible
FROM
P
二元邏輯迴歸
就二元邏輯迴歸而言,模擬預測的技巧與線性迴歸十分相似,但會額外在最後一個步驟套用符合指定閾值的 S 函數。
搭配權重使用特徵值的內積並加上截距,使用標準 SQL 進行預測,而不使用 ML.PREDICT
。
接著,對結果套用閾值為 0.5 的 S 函數,預測出 0 或 1。這項查詢會將採用該技巧的預測結果與使用 ML.PREDICT
的預測結果做比較。
WITH
T AS (
SELECT
label AS actual_label,
predicted_label AS ml_predicted_label,
[feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10] AS features,
[3.8235339279050287,
3.7348128191185244,
3.8422398227859471,
3.7854888232502479,
3.7373867156553713,
3.5676639605351026,
3.8196430517007811,
3.7346737628343032,
3.8393014063170749,
3.7873069939244743] AS weights
FROM
ML.PREDICT(MODEL `DATASET_NAME.BIN_LOG_REG_MODEL`,
(
SELECT
*
FROM
`PROJECT_NAME.DATASET_NAME.BIN_LOG_REG_TRAINING_SET`))
WHERE
holdback),
P AS (
SELECT
actual_label,
ml_predicted_label,
(
SELECT
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
FROM
T.features element1
WITH
OFFSET
pos
JOIN
T.weights element2
WITH
OFFSET
pos
USING
(pos) ) sql_predicted_label,
features,
weights
FROM
T)
SELECT
actual_label,
ml_predicted_label,
sql_predicted_label,
ml_predicted_label = sql_predicted_label AS simulation_is_accurate
FROM
P
在上述查詢中,以粗體顯示的 SQL 程式碼區塊會以模型的權重算出各列特徵值的內積,並加上截距來預測線性迴歸:
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
然後,這個區塊會使用標準 SQL,針對內積和截距套用 S 函數 Y = 1 / (1+e^-z)
:
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
最後,S 函數的結果會與 0.5 的閾值做比較,預測出 0 (小於 0.5 時) 或 1 (不小於 0.5 時) 的二元邏輯迴歸。請注意,您可以使用 0 到 1 之間的任何閾值。
IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
此外,這項技巧也可延伸至多類別邏輯迴歸。在這種情況下,模型的權重將為 nxn 矩陣而非向量,權重則是向量而非純量。您可以將特徵值向量乘以權重矩陣,並加上截距向量。產生的向量會顯示每個標籤的分數,您可以選擇分數最高的標籤進行預測。如果您原本是想傳回機率陣列,請將 S 函數套用至陣列的每個元素。
5. 比較結果
線性迴歸
除了極小的捨入誤差,範例結果幾乎完全相同。
actual_label | ml_predicted_label | sql_predicted_label | diff_is_negligible |
---|---|---|---|
6 | 5.2062349420751834 | 5.2062349420751826 | 是 |
0 | 0.40318472770048075 | 0.403184727700479 | 是 |
3 | 3.0703766078249597 | 3.0703766078249597 | 是 |
7 | 7.0588171538562 | 7.0588171538562 | 是 |
6 | 6.7802375930646 | 6.7802375930646 | 是 |
6 | 5.1088569571339368 | 5.1088569571339377 | 是 |
4 | 4.051839078116874 | 4.051839078116874 | 是 |
4 | 5.1810254680219243 | 5.1810254680219234 | 是 |
6 | 6.1440349466401223 | 6.1440349466401205 | 是 |
1 | 2.0842399472783519 | 2.0842399472783519 | 是 |
2 | 2.1911209811886847 | 2.1911209811886838 | 是 |
3 | 3.0236086790006622 | 3.0236086790006613 | 是 |
2 | 2.573083132964213 | 2.5730831329642125 | 是 |
7 | 5.68662973136732 | 5.6866297313673186 | 是 |
9 | 8.1860026312677938 | 8.1860026312677938 | 是 |
二元邏輯迴歸
比較模擬推論結果與 ML.PREDICT
實際結果是很理想的做法,這不會在 10,000 列的區隔劃分集中造成單一衝突。在某幾列中,ML.PREDICT
和模擬推論的結果都與實際標籤不一致,預期模型準確率約為 93%,而混淆矩陣非對角儲存格內的值較小 (但不為零)。
actual_label | ml_predicted_label | sql_predicted_label | simulation_is_accurate |
---|---|---|---|
0 | 1 | 1 | 是 |
0 | 0 | 0 | 是 |
0 | 0 | 0 | 是 |
0 | 0 | 0 | 是 |
0 | 0 | 0 | 是 |
0 | 0 | 0 | 是 |
0 | 0 | 0 | 是 |
0 | 0 | 0 | 是 |
0 | 0 | 0 | 是 |
0 | 0 | 0 | 是 |
0 | 1 | 1 | 是 |
0 | 0 | 0 | 是 |
使用機器學習技術建立目標對象啟用清單
常見的情況是建立差異化隱私二元邏輯迴歸模型,預測轉換次數,然後在建立目標對象名單時對這個模型進行推論。假設上例建立的二元邏輯模型要模擬轉換,且訓練和評估集的每個資料列都代表不同的使用者。
以下查詢說明如何建立目標對象名單,納入模型預測會完成轉換的使用者:
WITH
T AS (
SELECT
*,
label AS actual_label,
[feature_1,
feature_2,
feature_3,
feature_4,
feature_5,
feature_6,
feature_7,
feature_8,
feature_9,
feature_10] AS features,
[3.8235339279050287,
3.7348128191185244,
3.8422398227859471,
3.7854888232502479,
3.7373867156553713,
3.5676639605351026,
3.8196430517007811,
3.7346737628343032,
3.8393014063170749,
3.7873069939244743] AS weights
FROM
`PROJECT_NAME.DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
WHERE
holdback),
P AS (
SELECT
*,
(
SELECT
IF
((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
FROM
T.features element1
WITH
OFFSET
pos
JOIN
T.weights element2
WITH
OFFSET
pos
USING
(pos) ) predicted_label,
features,
weights
FROM
T),
SELECT
user_id
FROM
P
WHERE
predicted_label = 1;