目標對象名單的迴歸模擬

您可以使用現有的線性或邏輯迴歸模型,搭配已知權重進行預測,而不使用 ML.PREDICT (且即使無法存取該模型也可預測)。在此情況下,您必須採行因應做法,在廣告資料中心的目標對象啟用查詢中使用差異化隱私 (DP) 迴歸模型。

這個逐步範例將說明如何進行模擬推論,取得實際的線性和二元邏輯迴歸模型,然後與 ML.PREDICT 的結果進行比較,得出模擬結果的準確率。此外,本例也會實際示範如何使用二元邏輯模型建立目標對象名單,之後對目標對象啟用功能套用轉換模型時,這份名單就能派上用場。

範例總覽:

  1. 產生資料
  2. 訓練模型
  3. 取得權重和截距
  4. 模擬預測
  5. 比較結果

逐步範例

1. 產生資料

使用模擬資料建立資料表,用來訓練模型,並將一部分資料列用於區隔劃分集。

線性迴歸

CREATE OR REPLACE TABLE DATASET_NAME.LIN_REG_TRAINING_SET AS
  WITH
  A AS (
    SELECT
      *
    FROM
      UNNEST(GENERATE_ARRAY(1, 100000)) AS row_number),
  B AS (
    SELECT
      row_number,
      RAND() AS rand_label,
      RAND() AS rand_feature_1,
      RAND() AS rand_feature_2,
      RAND() AS rand_feature_3,
      RAND() AS rand_feature_4,
      RAND() AS rand_feature_5,
      RAND() AS rand_feature_6,
      RAND() AS rand_feature_7,
      RAND() AS rand_feature_8,
      RAND() AS rand_feature_9,
      RAND() AS rand_feature_10
    FROM
      A),
  C AS (
    SELECT
      rand_label AS label,
      *
    FROM
      B),
  D AS (
    SELECT
    row_number,
    CAST(round(10 * label) AS INT64) AS label,
    (rand_label + rand_feature_1) / 2 AS feature_1,
    (rand_label + rand_feature_2) / 2 AS feature_2,
    (rand_label + rand_feature_3) / 2 AS feature_3,
    (rand_label + rand_feature_4) / 2 AS feature_4,
    (rand_label + rand_feature_5) / 2 AS feature_5,
    (rand_label + rand_feature_6) / 2 AS feature_6,
    (rand_label + rand_feature_7) / 2 AS feature_7,
    (rand_label + rand_feature_8) / 2 AS feature_8,
    (rand_label + rand_feature_9) / 2 AS feature_9,
    (rand_label + rand_feature_10) / 2 AS feature_10
    FROM
    C)

SELECT
  label,
  feature_1,
  feature_2,
  feature_3,
  feature_4,
  feature_5,
  feature_6,
  feature_7,
  feature_8,
  feature_9,
  feature_10,
  RAND() < 0.1 AS holdback -- Ten percent will be true.
FROM
  D

二元邏輯迴歸

SELECT
  CASE
    WHEN label < 5 THEN 0
    WHEN label >= 5 THEN 1
  END
  AS label,
  * EXCEPT (label)
FROM
  `DATASET_NAME.BIN_LOG_REG_TRAINING_SET`

2. 訓練模型

根據訓練集訓練迴歸模型。

線性迴歸

CREATE OR REPLACE MODEL `DATASET_NAME.LIN_REG_MODEL` OPTIONS (model_type="linear_reg") AS
  SELECT
    * except (holdback)
  FROM
    `DATASET_NAME.LIN_REG_TRAINING_SET`
  WHERE
    NOT holdback

請注意,我們已在模擬資料中加入足夠的雜訊,取得 R2 = 0.9009 的模型。

評估
平均絕對誤差 0.7359
均方誤差 0.8432
均方對數誤差 0.0810
中位數絕對誤差 0.6239
R 平方 0.9009

二元邏輯迴歸

CREATE OR REPLACE MODEL `DATASET_NAME.BIN_LOG_REG_MODEL` OPTIONS (model_type="logistic_reg") AS
  SELECT
    * EXCEPT (holdback)
  FROM
    `DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
  WHERE
    NOT holdback

範例結果。請注意,準確率為 0.9260。

評估
正類 1
負類 0
精確度 0.0810
喚回度 0.9315
準確率 0.9260
F1 分數 0.9328

這個混淆矩陣中的粗體值代表模型正確分類每個標籤的頻率,非粗體值則是模型將每個標籤分類錯誤的頻率。

實際標籤 預測標籤 1 預測標籤 2
1 93% 7%
0 8% 92%

3. 取得權重和截距

取得模型的權重和截距:

線性迴歸

SELECT
  *
FROM
  ML.WEIGHTS(MODEL `DATASET_NAME.LIN_REG_MODEL`)
權重 category_weights.category
feature_1 1.8263055528635743
feature_2 1.8143804404490813
feature_3 1.8601204874033492
feature_4 1.8507603439031859
feature_5 1.7899764387123640
feature_6 1.8645246630251291
feature_7 1.8698005281925356
feature_8 1.7904637080330201
feature_9 1.8036887855406274
feature_10 1.8117115890624449
截距 -4.1428754911504306

二元邏輯迴歸

SELECT
  *
FROM
  ML.WEIGHTS(MODEL `DATASET_NAME.BIN_LOG_REG_MODEL`)
權重 category_weights.category
feature_1 3.823533928
feature_2 3.734812819
feature_3 3.842239823
feature_4 3.785488823
feature_5 3.737386716
feature_6 3.567663961
feature_7 3.819643052
feature_8 3.734673763
feature_9 3.839301406
feature_10 3.787306994
截距 -17.922169920

4. 模擬預測

線性迴歸

搭配權重使用特徵值的內積並加上截距,使用標準 SQL 進行預測,而不使用 ML.PREDICT。這項查詢會將採用該技巧的預測結果與使用 ML.PREDICT 的預測結果做比較。請注意,以粗體顯示的 SQL 行會以模型權重算出該列特徵值的內積,然後加上截距。

WITH
T AS (
SELECT
  label AS actual_label,
  predicted_label AS ml_predicted_label,
  [feature_1,
  feature_2,
  feature_3,
  feature_4,
  feature_5,
  feature_6,
  feature_7,
  feature_8,
  feature_9,
  feature_10] AS features,
  [1.8263055528635743,
  1.8143804404490813,
  1.8601204874033492,
  1.8507603439031859,
  1.789976438712364,
  1.8645246630251291,
  1.8698005281925356,
  1.7904637080330201,
  1.8036887855406274,
  1.8117115890624449] AS weights
FROM
  ML.PREDICT(MODEL `DATASET_NAME.LIN_REG_MODEL`,
    (
    SELECT
      *
    FROM
      `PROJECT_NAME.DATASET_NAME.LIN_REG_TRAINING_SET`))
WHERE
  holdback),
P AS (
SELECT
  actual_label,
  ml_predicted_label,
  (
   SELECT
    SUM(element1 * element2) - 4.1428754911504306
  FROM
    T.features element1
  WITH
  OFFSET
    pos
  JOIN
    T.weights element2
  WITH
  OFFSET
    pos
  USING
    (pos) ) sql_predicted_label,
  features,
  weights
FROM
  T)
SELECT
  actual_label,
  ml_predicted_label,
  sql_predicted_label,
  ABS(ml_predicted_label - sql_predicted_label) < 0.00000000001 AS diff_is_negligible
FROM
  P

二元邏輯迴歸

就二元邏輯迴歸而言,模擬預測的技巧與線性迴歸十分相似,但會額外在最後一個步驟套用符合指定閾值的 S 函數。

搭配權重使用特徵值的內積並加上截距,使用標準 SQL 進行預測,而不使用 ML.PREDICT。 接著,對結果套用閾值為 0.5 的 S 函數,預測出 0 或 1。這項查詢會將採用該技巧的預測結果與使用 ML.PREDICT 的預測結果做比較。

WITH
T AS (
SELECT
  label AS actual_label,
  predicted_label AS ml_predicted_label,
  [feature_1,
  feature_2,
  feature_3,
  feature_4,
  feature_5,
  feature_6,
  feature_7,
  feature_8,
  feature_9,
  feature_10] AS features,
  [3.8235339279050287,
  3.7348128191185244,
  3.8422398227859471,
  3.7854888232502479,
  3.7373867156553713,
  3.5676639605351026,
  3.8196430517007811,
  3.7346737628343032,
  3.8393014063170749,
  3.7873069939244743] AS weights
FROM
  ML.PREDICT(MODEL `DATASET_NAME.BIN_LOG_REG_MODEL`,
    (
    SELECT
      *
    FROM
      `PROJECT_NAME.DATASET_NAME.BIN_LOG_REG_TRAINING_SET`))
WHERE
  holdback),
P AS (
SELECT
  actual_label,
  ml_predicted_label,
  (
   SELECT
    IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
  FROM
    T.features element1
  WITH
  OFFSET
    pos
  JOIN
    T.weights element2
  WITH
  OFFSET
    pos
  USING
    (pos) ) sql_predicted_label,
  features,
  weights
FROM
  T)
SELECT
  actual_label,
  ml_predicted_label,
  sql_predicted_label,
  ml_predicted_label = sql_predicted_label AS simulation_is_accurate
FROM
  P

在上述查詢中,以粗體顯示的 SQL 程式碼區塊會以模型的權重算出各列特徵值的內積,並加上截距來預測線性迴歸:

IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)

然後,這個區塊會使用標準 SQL,針對內積和截距套用 S 函數 Y = 1 / (1+e^-z)

IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)

最後,S 函數的結果會與 0.5 的閾值做比較,預測出 0 (小於 0.5 時) 或 1 (不小於 0.5 時) 的二元邏輯迴歸。請注意,您可以使用 0 到 1 之間的任何閾值。

IF((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)

此外,這項技巧也可延伸至多類別邏輯迴歸。在這種情況下,模型的權重將為 nxn 矩陣而非向量,權重則是向量而非純量。您可以將特徵值向量乘以權重矩陣,並加上截距向量。產生的向量會顯示每個標籤的分數,您可以選擇分數最高的標籤進行預測。如果您原本是想傳回機率陣列,請將 S 函數套用至陣列的每個元素。

5. 比較結果

線性迴歸

除了極小的捨入誤差,範例結果幾乎完全相同。

actual_label ml_predicted_label sql_predicted_label diff_is_negligible
6 5.2062349420751834 5.2062349420751826
0 0.40318472770048075 0.403184727700479
3 3.0703766078249597 3.0703766078249597
7 7.0588171538562 7.0588171538562
6 6.7802375930646 6.7802375930646
6 5.1088569571339368 5.1088569571339377
4 4.051839078116874 4.051839078116874
4 5.1810254680219243 5.1810254680219234
6 6.1440349466401223 6.1440349466401205
1 2.0842399472783519 2.0842399472783519
2 2.1911209811886847 2.1911209811886838
3 3.0236086790006622 3.0236086790006613
2 2.573083132964213 2.5730831329642125
7 5.68662973136732 5.6866297313673186
9 8.1860026312677938 8.1860026312677938

二元邏輯迴歸

比較模擬推論結果與 ML.PREDICT 實際結果是很理想的做法,這不會在 10,000 列的區隔劃分集中造成單一衝突。在某幾列中,ML.PREDICT 和模擬推論的結果都與實際標籤不一致,預期模型準確率約為 93%,而混淆矩陣非對角儲存格內的值較小 (但不為零)。

actual_label ml_predicted_label sql_predicted_label simulation_is_accurate
0 1 1
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 1 1
0 0 0

使用機器學習技術建立目標對象啟用清單

常見的情況是建立差異化隱私二元邏輯迴歸模型,預測轉換次數,然後在建立目標對象名單時對這個模型進行推論。假設上例建立的二元邏輯模型要模擬轉換,且訓練和評估集的每個資料列都代表不同的使用者。

以下查詢說明如何建立目標對象名單,納入模型預測會完成轉換的使用者:

WITH
T AS (
SELECT
  *,
  label AS actual_label,
  [feature_1,
  feature_2,
  feature_3,
  feature_4,
  feature_5,
  feature_6,
  feature_7,
  feature_8,
  feature_9,
  feature_10] AS features,
  [3.8235339279050287,
  3.7348128191185244,
  3.8422398227859471,
  3.7854888232502479,
  3.7373867156553713,
  3.5676639605351026,
  3.8196430517007811,
  3.7346737628343032,
  3.8393014063170749,
  3.7873069939244743] AS weights
FROM
  `PROJECT_NAME.DATASET_NAME.BIN_LOG_REG_TRAINING_SET`
WHERE
  holdback),
P AS (
SELECT
  *,
  (
  SELECT
  IF
    ((1 / (1 + EXP(-(SUM(element1 * element2) -17.922169920432161)))) < 0.5, 0, 1)
  FROM
    T.features element1
  WITH
  OFFSET
    pos
  JOIN
    T.weights element2
  WITH
  OFFSET
    pos
  USING
    (pos) ) predicted_label,
  features,
  weights
FROM
  T),
SELECT
  user_id
FROM
  P
WHERE
  predicted_label = 1;