URLs and Hashing
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
本節包含詳細的規格說明,說明用戶端如何檢查網址。
網址標準化
在檢查任何網址之前,用戶端應對該網址執行一些標準化程序。
首先,我們假設用戶端已剖析網址,並根據 RFC 2396 將其設為有效。如果網址使用國際化網域名稱 (IDN),用戶端應將網址轉換為 ASCII 域名代碼 (Punycode) 表示法。網址必須包含路徑元件,也就是在網域後方至少要有一個斜線 (http://google.com/
而非 http://google.com
)。
首先,請從網址中移除 tab (0x09)、CR (0x0d) 和 LF (0x0a) 字元。請勿移除這些字元的轉義序列 (例如 %0a
)。
第二,如果網址結尾是片段,請移除該片段。例如將 http://google.com/#frag
縮短為 http://google.com/
。
第三,重複執行百分比逸出作業,直到網址不再有百分號逸出為止。(這可能會導致網址失效)。
如何將主機名稱標準化:
從網址中擷取主機名稱,然後執行下列操作:
- 移除開頭和結尾的所有點。
- 將連續的點替換為單一點。
- 如果主機名稱可解析為 IPv4 位址,請將其標準化為 4 個以點號分隔的小數值。用戶端應處理任何合法的 IP 位址編碼,包括八進位、十六進位,以及少於四個元件的編碼。
- 如果主機名稱可解析為帶括號的 IPv6 位址,請移除元件中不必要的開頭零,並使用雙冒號語法收合零元件,以便將主機名稱標準化。舉例來說,
[2001:0db8:0000::1]
應轉換為 [2001:db8::1]
。如果主機名稱是下列兩種特殊 IPv6 位址類型之一,請將其轉換為 IPv4:
- 已對應至 IPv4 的 IPv6 位址,例如
[::ffff:1.2.3.4]
,應轉換為 1.2.3.4
;
- 使用已知前置字串 64:ff9b::/96 的 NAT64 位址,例如
[64:ff9b::1.2.3.4]
,應轉換為 1.2.3.4
。
- 將整個字串轉為小寫。
如何將路徑標準化:
- 將
/./
替換為 /
,並移除 /../
和前面的路徑元件,藉此解決路徑中的 /../
和 /./
序列。
- 將連續的斜線替換為單一斜線字元。
請勿將這些路徑標準化方式套用至查詢參數。
在網址中,將所有小於等於 ASCII 32、大於等於 127、#
或 %
的字元以百分比逸出。應使用大寫的 16 進制字元做為逃逸字元。
主機後置字串路徑前置字串運算式
網址完成標準化後,下一步就是建立字尾/字首運算式。每個後置字元/前置字元運算式都包含主機後置字元 (或完整主機) 和路徑前置字元 (或完整路徑)。
用戶端最多可組合 30 個主機後置字元和路徑前置字元。這些組合只會使用網址的主機和路徑元件。系統會捨棄配置方式、使用者名稱、密碼和通訊埠。如果網址包含查詢參數,至少會有一個組合包含完整路徑和查詢參數。
主機:用戶端最多會嘗試五個不同的字串。這 3 個子類型如下:
- 如果主機名稱不是 IPv4 或 IPv6 文字常值,則最多可使用四個主機名稱,這些名稱的開頭是 eTLD+1 網域,並加上連續的開頭元件。判定 eTLD+1 時,應參考公開字尾清單。舉例來說,
a.b.example.com
會產生 example.com
的 eTLD+1 網域,以及主機 (含有額外主機元件 b.example.com
)。
- 網址中的確切主機名稱。根據先前的範例,系統會檢查
a.b.example.com
。
對於路徑,用戶端最多會嘗試六個不同的字串。這 3 個子類型如下:
- 網址的確切路徑,包括查詢參數。
- 不含查詢參數的網址確切路徑。
- 四個路徑的形成方式是從根目錄 (/) 開始,並依序附加路徑元件,包括結尾斜線。
以下範例說明檢查行為:
對於網址 http://a.b.com/1/2.html?param=1
,用戶端會嘗試以下可能的字串:
a.b.com/1/2.html?param=1
a.b.com/1/2.html
a.b.com/
a.b.com/1/
b.com/1/2.html?param=1
b.com/1/2.html
b.com/
b.com/1/
對於網址 http://a.b.c.d.e.f.com/1.html
,用戶端會嘗試以下可能的字串:
a.b.c.d.e.f.com/1.html
a.b.c.d.e.f.com/
c.d.e.f.com/1.html
c.d.e.f.com/
d.e.f.com/1.html
d.e.f.com/
e.f.com/1.html
e.f.com/
f.com/1.html
f.com/
(注意:請略過 b.c.d.e.f.com
,因為我們只會採用最後五個主機名稱元件和完整主機名稱)。
對於網址 http://1.2.3.4/1/
,用戶端會嘗試以下可能的字串:
1.2.3.4/1/
1.2.3.4/
對於網址 http://example.co.uk/1
,用戶端會嘗試以下可能的字串:
example.co.uk/1
example.co.uk/
雜湊
Google 安全瀏覽功能只使用 SHA256 做為雜湊函式。這個雜湊函式應套用至上述運算式。
完整的 32 位元雜湊會視情況截斷為 4 位元、8 位元或 16 位元:
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-07-25 (世界標準時間)。
[null,null,["上次更新時間:2025-07-25 (世界標準時間)。"],[],[],null,["# URLs and Hashing\n\nThis section contains detailed specifications of how clients check URLs.\n\n### Canonicalization of URLs\n\nBefore any URLs are checked, the client is expected to perform some canonicalization on that URL.\n\nTo begin, we assume that the client has parsed the URL and made it valid according to RFC 2396. If the URL uses an internationalized domain name (IDN), the client should convert the URL to the ASCII Punycode representation. The URL must include a path component; that is, it must have at least one slash following the domain (`http://google.com/` instead of `http://google.com`).\n\nFirst, remove tab (0x09), CR (0x0d), and LF (0x0a) characters from the URL. Do not remove escape sequences for these characters (e.g. `%0a`).\n\nSecond, if the URL ends in a fragment, remove the fragment. For example, shorten `http://google.com/#frag` to `http://google.com/`.\n\nThird, repeatedly percent-unescape the URL until it has no more percent-escapes. (This may render the URL invalid.)\n\n**To canonicalize the hostname:**\n\nExtract the hostname from the URL and then:\n\n1. Remove all leading and trailing dots.\n2. Replace consecutive dots with a single dot.\n3. If the hostname can be parsed as an IPv4 address, normalize it to 4 dot-separated decimal values. The client should handle any legal IP-address encoding, including octal, hex, and fewer than four components.\n4. If the hostname can be parsed as a bracketed IPv6 address, normalize it by removing unnecessary leading zeroes in the components and collapsing zero components by using the double-colon syntax. For example `[2001:0db8:0000::1]` should be transformed into `[2001:db8::1]`. If the hostname is one of the two following special IPv6 address types, transform them into IPv4:\n - An IPv4-mapped IPv6 address, such as `[::ffff:1.2.3.4]`, which should be transformed into `1.2.3.4`;\n - A NAT64 address using [the well-known prefix 64:ff9b::/96](https://datatracker.ietf.org/doc/html/rfc6052#section-2.1), such as `[64:ff9b::1.2.3.4]`, which should be transformed into `1.2.3.4`.\n5. Lowercase the whole string.\n\n**To canonicalize the path:**\n\n1. Resolve the sequences `/../` and `/./` in the path by replacing `/./` with `/`, and removing `/../` along with the preceding path component.\n2. Replace runs of consecutive slashes with a single slash character.\n\nDo not apply these path canonicalizations to the query parameters.\n\nIn the URL, percent-escape all characters that are \\\u003c= ASCII 32, \\\u003e= 127, `#`, or `%`. The escapes should use uppercase hex characters.\n\n### Host-Suffix Path-Prefix Expressions\n\nOnce the URL is canonicalized, the next step is to create the suffix/prefix expressions. Each suffix/prefix expression consists of a host suffix (or full host) and a path prefix (or full path).\n\nThe client will form up to 30 different possible host suffix and path prefix combinations. These combinations use only the host and path components of the URL. The scheme, username, password, and port are discarded. If the URL includes query parameters, then at least one combination will include the full path and query parameters.\n\n**For the host**, the client will try at most five different strings. They are:\n\n- If the hostname is not an IPv4 or IPv6 literal, up to four hostnames formed by starting with the eTLD+1 domain and adding successive leading components. The determination of eTLD+1 should be based on the [Public Suffix List](https://publicsuffix.org/). For example, `a.b.example.com` would result in the eTLD+1 domain of `example.com` as well as the host with one additional host component `b.example.com`.\n- The exact hostname in the URL. Following the previous example, `a.b.example.com` would be checked.\n\n**For the path**, the client will try at most six different strings. They are:\n\n- The exact path of the URL, including query parameters.\n- The exact path of the URL, without query parameters.\n- The four paths formed by starting at the root (/) and successively appending path components, including a trailing slash.\n\nThe following examples illustrate the check behavior:\n\nFor the URL `http://a.b.com/1/2.html?param=1`, the client will try these possible strings: \n\n a.b.com/1/2.html?param=1\n a.b.com/1/2.html\n a.b.com/\n a.b.com/1/\n b.com/1/2.html?param=1\n b.com/1/2.html\n b.com/\n b.com/1/\n\nFor the URL `http://a.b.c.d.e.f.com/1.html`, the client will try these possible strings: \n\n a.b.c.d.e.f.com/1.html\n a.b.c.d.e.f.com/\n c.d.e.f.com/1.html\n c.d.e.f.com/\n d.e.f.com/1.html\n d.e.f.com/\n e.f.com/1.html\n e.f.com/\n f.com/1.html\n f.com/\n\n(Note: skip `b.c.d.e.f.com`, since we'll take only the last five hostname components, and the full hostname.)\n\nFor the URL `http://1.2.3.4/1/`, the client will try these possible strings: \n\n 1.2.3.4/1/\n 1.2.3.4/\n\nFor the URL `http://example.co.uk/1`, the client will try these possible strings: \n\n example.co.uk/1\n example.co.uk/\n\n### Hashing\n\nGoogle Safe Browsing exclusively uses SHA256 as the hash function. This hash function should be applied to the above expressions.\n\nThe full 32-byte hash will, depending on the circumstances, be truncated to 4 bytes, 8 bytes, or 16 bytes:\n\n- When using the [hashes.search method](/safe-browsing/reference/rest/v5/hashes/search), we currently require the hashes in the request to be truncated to exactly 4 bytes. Sending additional bytes in this request will compromise user privacy.\n\n- When downloading the lists for the local database using the [hashList.get method](/safe-browsing/reference/rest/v5/hashList/get) or the [hashLists.batchGet method](/safe-browsing/reference/rest/v5/hashLists/batchGet), the length of the hashes sent by the server is encoded within the naming convention of the lists that contain suffix indicating hash length. See [Available Lists](/safe-browsing/reference/Local.Database#available-lists) section for more details."]]