URLs and Hashing
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
本部分详细介绍了客户端如何检查网址。
网址规范化
在检查任何网址之前,客户端应对该网址执行一些规范化操作。
首先,我们假设客户端已解析该网址,并根据 RFC 2396 将其设为有效。如果该网址使用国际化域名 (IDN),则客户端应将网址转换为 ASCII Punycode 表示法。网址必须包含路径部分;也就是说,它必须在域名后面至少有一个斜杠 (http://google.com/
,而不是 http://google.com
)。
首先,请从网址中移除 Tab (0x09)、CR (0x0d) 和 LF (0x0a) 字符。请勿移除这些字符的转义序列(例如 %0a
)。
其次,如果网址以片段结尾,请移除片段。例如,将 http://google.com/#frag
缩写为 http://google.com/
。
第三,反复对网址进行百分比转义,直到它不再有百分比转义。(这可能会导致网址无效。)
如需规范化主机名,请执行以下操作:
从网址中提取主机名,然后:
- 移除所有前导和尾随的点。
- 用单点替换连续的点。
- 如果可以将主机名解析为 IPv4 地址,请将其标准化为 4 个以英文句点分隔的十进制值。客户端应处理任何合法的 IP 地址编码,包括八进制、十六进制以及少于四个的组件。
- 如果可以将主机名解析为带方括号的 IPv6 地址,请通过移除组件中不必要的开头零和使用双冒号语法来收起零组件,将其标准化。例如,
[2001:0db8:0000::1]
应转换为 [2001:db8::1]
。如果主机名是以下两种特殊 IPv6 地址类型之一,请将其转换为 IPv4:
- 映射到 IPv4 的 IPv6 地址,例如
[::ffff:1.2.3.4]
,应转换为 1.2.3.4
;
- 使用知名前缀 64:ff9b::/96 的 NAT64 地址,例如
[64:ff9b::1.2.3.4]
,应转换为 1.2.3.4
。
- 将整个字符串小写。
如需规范化路径,请执行以下操作:
- 通过将
/./
替换为 /
并移除 /../
和上述路径组件,可解析路径中的序列 /../
和 /./
。
- 将连续斜杠替换成单个斜杠字符。
请勿将这些路径规范化应用于查询参数。
在网址中,对所有小于等于 ASCII 32、大于等于 127、#
或 %
的字符进行百分比转义。转义字符应使用大写的十六进制字符。
主机-后缀路径-前缀表达式
规范化网址后,下一步是创建后缀/前缀表达式。每个后缀/前缀表达式都包含一个主机后缀(或完整主机)以及一个路径前缀(或完整路径)。
客户端将形成多达 30 种可能的主机后缀和路径前缀组合。这些组合仅使用网址的主机和路径部分。方案、用户名、密码和端口会被舍弃。如果网址包含查询参数,则至少有一个组合会包含完整路径和查询参数。
对于主机,客户端最多可尝试 5 个不同的字符串。它们分别是:
- 如果主机名不是 IPv4 或 IPv6 字面量,则最多可以形成四个主机名,方法是从 eTLD+1 网域开始,然后依次添加前导组件。确定 eTLD+1 时应参考公共后缀列表。例如,
a.b.example.com
会产生 eTLD+1 域名 example.com
,以及包含一个额外主机组件 b.example.com
的主机。
- 网址中的确切主机名。按照前面的示例,系统会检查
a.b.example.com
。
对于路径,客户端最多可尝试 6 个不同的字符串。他们分别是:
- 网址的确切路径包括查询参数。
- 网址的确切路径不包含查询参数。
- 从根 (/) 开始并依次附加路径组件(包括尾部斜杠)形成的四个路径。
以下示例演示了检查行为:
对于网址 http://a.b.com/1/2.html?param=1
,客户端将尝试以下可能的字符串:
a.b.com/1/2.html?param=1
a.b.com/1/2.html
a.b.com/
a.b.com/1/
b.com/1/2.html?param=1
b.com/1/2.html
b.com/
b.com/1/
对于网址 http://a.b.c.d.e.f.com/1.html
,客户端将尝试以下可能的字符串:
a.b.c.d.e.f.com/1.html
a.b.c.d.e.f.com/
c.d.e.f.com/1.html
c.d.e.f.com/
d.e.f.com/1.html
d.e.f.com/
e.f.com/1.html
e.f.com/
f.com/1.html
f.com/
(注意:请跳过 b.c.d.e.f.com
,因为我们只会取最后五个主机名组件和完整的主机名。)
对于网址 http://1.2.3.4/1/
,客户端将尝试以下可能的字符串:
1.2.3.4/1/
1.2.3.4/
对于网址 http://example.co.uk/1
,客户端将尝试以下可能的字符串:
example.co.uk/1
example.co.uk/
哈希技术
Google 安全浏览功能仅使用 SHA256 作为哈希函数。此哈希函数应应用于上述表达式。
完整的 32 字节哈希将根据具体情况截断为 4 字节、8 字节或 16 字节:
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-25。
[null,null,["最后更新时间 (UTC):2025-07-25。"],[],[],null,["# URLs and Hashing\n\nThis section contains detailed specifications of how clients check URLs.\n\n### Canonicalization of URLs\n\nBefore any URLs are checked, the client is expected to perform some canonicalization on that URL.\n\nTo begin, we assume that the client has parsed the URL and made it valid according to RFC 2396. If the URL uses an internationalized domain name (IDN), the client should convert the URL to the ASCII Punycode representation. The URL must include a path component; that is, it must have at least one slash following the domain (`http://google.com/` instead of `http://google.com`).\n\nFirst, remove tab (0x09), CR (0x0d), and LF (0x0a) characters from the URL. Do not remove escape sequences for these characters (e.g. `%0a`).\n\nSecond, if the URL ends in a fragment, remove the fragment. For example, shorten `http://google.com/#frag` to `http://google.com/`.\n\nThird, repeatedly percent-unescape the URL until it has no more percent-escapes. (This may render the URL invalid.)\n\n**To canonicalize the hostname:**\n\nExtract the hostname from the URL and then:\n\n1. Remove all leading and trailing dots.\n2. Replace consecutive dots with a single dot.\n3. If the hostname can be parsed as an IPv4 address, normalize it to 4 dot-separated decimal values. The client should handle any legal IP-address encoding, including octal, hex, and fewer than four components.\n4. If the hostname can be parsed as a bracketed IPv6 address, normalize it by removing unnecessary leading zeroes in the components and collapsing zero components by using the double-colon syntax. For example `[2001:0db8:0000::1]` should be transformed into `[2001:db8::1]`. If the hostname is one of the two following special IPv6 address types, transform them into IPv4:\n - An IPv4-mapped IPv6 address, such as `[::ffff:1.2.3.4]`, which should be transformed into `1.2.3.4`;\n - A NAT64 address using [the well-known prefix 64:ff9b::/96](https://datatracker.ietf.org/doc/html/rfc6052#section-2.1), such as `[64:ff9b::1.2.3.4]`, which should be transformed into `1.2.3.4`.\n5. Lowercase the whole string.\n\n**To canonicalize the path:**\n\n1. Resolve the sequences `/../` and `/./` in the path by replacing `/./` with `/`, and removing `/../` along with the preceding path component.\n2. Replace runs of consecutive slashes with a single slash character.\n\nDo not apply these path canonicalizations to the query parameters.\n\nIn the URL, percent-escape all characters that are \\\u003c= ASCII 32, \\\u003e= 127, `#`, or `%`. The escapes should use uppercase hex characters.\n\n### Host-Suffix Path-Prefix Expressions\n\nOnce the URL is canonicalized, the next step is to create the suffix/prefix expressions. Each suffix/prefix expression consists of a host suffix (or full host) and a path prefix (or full path).\n\nThe client will form up to 30 different possible host suffix and path prefix combinations. These combinations use only the host and path components of the URL. The scheme, username, password, and port are discarded. If the URL includes query parameters, then at least one combination will include the full path and query parameters.\n\n**For the host**, the client will try at most five different strings. They are:\n\n- If the hostname is not an IPv4 or IPv6 literal, up to four hostnames formed by starting with the eTLD+1 domain and adding successive leading components. The determination of eTLD+1 should be based on the [Public Suffix List](https://publicsuffix.org/). For example, `a.b.example.com` would result in the eTLD+1 domain of `example.com` as well as the host with one additional host component `b.example.com`.\n- The exact hostname in the URL. Following the previous example, `a.b.example.com` would be checked.\n\n**For the path**, the client will try at most six different strings. They are:\n\n- The exact path of the URL, including query parameters.\n- The exact path of the URL, without query parameters.\n- The four paths formed by starting at the root (/) and successively appending path components, including a trailing slash.\n\nThe following examples illustrate the check behavior:\n\nFor the URL `http://a.b.com/1/2.html?param=1`, the client will try these possible strings: \n\n a.b.com/1/2.html?param=1\n a.b.com/1/2.html\n a.b.com/\n a.b.com/1/\n b.com/1/2.html?param=1\n b.com/1/2.html\n b.com/\n b.com/1/\n\nFor the URL `http://a.b.c.d.e.f.com/1.html`, the client will try these possible strings: \n\n a.b.c.d.e.f.com/1.html\n a.b.c.d.e.f.com/\n c.d.e.f.com/1.html\n c.d.e.f.com/\n d.e.f.com/1.html\n d.e.f.com/\n e.f.com/1.html\n e.f.com/\n f.com/1.html\n f.com/\n\n(Note: skip `b.c.d.e.f.com`, since we'll take only the last five hostname components, and the full hostname.)\n\nFor the URL `http://1.2.3.4/1/`, the client will try these possible strings: \n\n 1.2.3.4/1/\n 1.2.3.4/\n\nFor the URL `http://example.co.uk/1`, the client will try these possible strings: \n\n example.co.uk/1\n example.co.uk/\n\n### Hashing\n\nGoogle Safe Browsing exclusively uses SHA256 as the hash function. This hash function should be applied to the above expressions.\n\nThe full 32-byte hash will, depending on the circumstances, be truncated to 4 bytes, 8 bytes, or 16 bytes:\n\n- When using the [hashes.search method](/safe-browsing/reference/rest/v5/hashes/search), we currently require the hashes in the request to be truncated to exactly 4 bytes. Sending additional bytes in this request will compromise user privacy.\n\n- When downloading the lists for the local database using the [hashList.get method](/safe-browsing/reference/rest/v5/hashList/get) or the [hashLists.batchGet method](/safe-browsing/reference/rest/v5/hashLists/batchGet), the length of the hashes sent by the server is encoded within the naming convention of the lists that contain suffix indicating hash length. See [Available Lists](/safe-browsing/reference/Local.Database#available-lists) section for more details."]]