嵌入:互动练习
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
以下基于 TensorFlow Embedding Projector 的 widget 会将 10,000 个 word2vec
静态向量展平到 3D 空间中。这种维度合并可能会造成误导,因为在原始高维空间中彼此最接近的点在 3D 投影中可能会显得相距更远。最接近的 n 个点以紫色突出显示,n 由用户在 Isolate __ points 中设定。右侧的边栏会标识最近的近邻。
在这些实验中,您将使用上面的 widget 中的 word2vec
嵌入。
任务 1
尝试查找以下内容的 20 个最近邻,并查看这些组在云中的分布情况。
iii
、third
和three
tao
和way
orange
、yellow
和juice
您对这些结果有何发现?
点击此处查看我们的答案
尽管 iii
、third
和 three
在语义上相似,但它们在文本中出现的上下文不同,并且在该嵌入空间中似乎没有靠得很近。在 word2vec
中,iii
离 iv
比离 third
更近。
同样,虽然 way
是 tao
的直接翻译,但在所用数据集中,这两个字词最常与完全不同的字词组合在一起,因此这两个向量相差很大。
orange
的前几个最近邻是颜色,但与 orange
的水果含义相关的 juice
和 peel
分别显示为第 14 和第 18 个最近邻。与此同时,取自奥兰治亲王含义的 prince
排在第 17 位。在投影中,最接近 orange
的字词是 yellow
和其他颜色,而最接近 juice
的字词不包含 orange
。
任务 2
尝试找出训练数据的一些特征。例如,尝试查找以下内容的 100 个最近邻,并查看这些组在云中的所处位置:
boston
、paris
、tokyo
、delhi
、moscow
和 seoul
(这道题有陷阱)
jane
、sarah
、john
、peter
、rosa
和 juan
点击此处查看我们的答案
boston
的许多最近邻都是美国的其他城市。paris
的许多最近邻都是欧洲的其他城市。tokyo
和 delhi
似乎没有类似的结果:一个与全球各地的旅行枢纽城市相关,另一个与 india
和相关字词相关。seoul
则完全没有出现在这个经过精简的字词向量集中。
该数据集似乎包含许多与美国国家地理位置相关的文档,一些文档与欧洲地区地理位置相关,但对其他国家/地区的覆盖范围不够精细。
同样,此数据集似乎包含许多男性英语名字、一些女性英语名字,而其他语言的名字数量要少得多。请注意,Don Rosa 曾为迪士尼创作过《Scrooge McDuck》漫画,这可能是“scrooge”和“mcduck”是“rosa”的最近邻的原因。
word2vec
提供的预训练字词向量实际上是基于截至 2013 年的 Google 新闻报道进行训练的。
任务 3
嵌入不限于字词。还可以嵌入图片、音频和其他数据。在此任务中:
- 打开 TensorFlow 的 Embedding Projector。
- 在标题为 Data 的左侧边栏中,选择 Mnist with images。系统随即会显示手写数字 MNIST 数据库的嵌入投影。
- 点击以停止轮播,然后选择一张图片。根据需要放大和缩小。
- 在右侧边栏中查找最近邻。有什么意外的发现吗?
- 为什么某些
7
的最近邻是 1
?而某些 8
的最近邻是 9
?
- 投影空间边缘的图片是否与投影空间中心的图片有所不同?
请注意,创建这些嵌入的模型会接收图片数据(即像素),并为每张图片选择一个数值向量表示法。该模型不会自动推测手写数字图片与数字本身之间的联系。
点击此处查看我们的答案
由于形状相似,一些较窄的 7
的向量表示法会更靠近手写 1
的向量。某些 8
和 9
,甚至某些 5
和 3
也会发生同样的情况。
投影空间外部的手写数字看起来是更容易辨认的九位数字之一,并且不易与其他数字混淆。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-05-15。
[null,null,["最后更新时间 (UTC):2025-05-15。"],[],[],null,["# Embeddings: Interactive exercises\n\nThe following widget, based on TensorFlow's\n[Embedding Projector](https://projector.tensorflow.org/), flattens 10,000\n`word2vec` static vectors into a 3D space. This collapse of dimensions can be\nmisleading, because the points closest to each other in the original\nhigh-dimensional space may appear farther apart in the 3D projection. The\nclosest *n* points are highlighted in purple, with *n* chosen by the user in\n**Isolate __ points**. The sidebar on the right identifies those nearest\nneighbors. \n\nIn these experiments, you'll play with the `word2vec` embeddings in the widget\nabove.\n\nTask 1\n------\n\nTry to find the 20 nearest neighbors for the following, and see where the\ngroups fall in the cloud.\n\n- `iii`, `third`, and `three`\n- `tao` and `way`\n- `orange`, `yellow`, and `juice`\n\nWhat do you notice about these results? \n**Click here for our answer**\n\nEven though `iii`, `third`, and `three`\nare semantically similar, they appear in different contexts in text and\ndon't appear to be close together in this embedding space. In\n`word2vec`, `iii` is closer to `iv` than to\n`third`.\n\nSimilarly, while `way` is a direct translation of `tao`,\nthese words most frequently occur with completely different groups of words\nin the dataset used, and so the two vectors are very far apart.\n\nThe first several nearest neighbors of `orange` are colors, but\n`juice` and `peel`, related to the meaning of\n`orange` as fruit, show up as the 14th\nand 18th nearest neighbors. `prince`, meanwhile, as in the\nPrince of Orange, is 17th. In the projection, the words closest to\n`orange` are `yellow` and other\ncolors, while the closest words to `juice` don't include\n`orange`.\n\nTask 2\n------\n\nTry to figure out some characteristics of the training data. For example, try\nto find the 100 nearest neighbors for the following, and see where the groups\nare in the cloud:\n\n- `boston`, `paris`, `tokyo`, `delhi`, `moscow`, and `seoul` (this is a trick question)\n- `jane`, `sarah`, `john`, `peter`, `rosa`, and `juan`\n\n**Click here for our answer**\n\nMany of the nearest neighbors to `boston` are other cities in\nthe US. Many of the nearest neighbors to `paris` are other cities\nin Europe. `tokyo` and `delhi` don't seem to have\nsimilar results: one is associated with cities around the world that are\ntravel hubs, while the other is associated with `india` and related\nwords. `seoul` doesn't appear in this trimmed-down set of\nword vectors at all.\n\nIt seems that this dataset contains many documents related to US national\ngeography, some documents relate to European regional geography, and not\nmuch fine-grained coverage of other countries or regions.\n\nSimilarly, this dataset seems to contain many male English names, some female\nEnglish names, and far fewer names from other languages. Note that Don Rosa\nwrote and illustrated Scrooge McDuck comics for Disney, which is the likely\nreason that \\`scrooge\\` and \\`mcduck\\` are among the nearest neighbors for \\`rosa\\`.\n\nThe pre-trained word vectors offered by `word2vec` were in fact\ntrained on\n[Google News articles up to 2013](https://code.google.com/archive/p/word2vec/).\n\nTask 3\n------\n\nEmbeddings aren't limited to words. Images, audio, and other data can also be\nembedded. For this task:\n\n1. Open TensorFlow's [Embedding Projector](https://projector.tensorflow.org/).\n2. In the left sidebar titled **Data** , choose **Mnist with images** . This brings up a projection of the embeddings of the [MNIST](https://developers.google.com/machine-learning/glossary#mnist) database of handwritten digits.\n3. Click to stop the rotation and choose a single image. Zoom in and out as needed.\n4. Look in the right sidebar for nearest neighbors. Are there any surprises?\n\n- Why do some `7`s have `1`s as their nearest neighbor? Why do some `8`s have `9` as their nearest neighbor?\n- Is there anything about the images on the edges of the projection space that seem different from the images in the center of the projection space?\n\nKeep in mind that the model that created these embeddings is receiving image\ndata, which is to say, pixels, and choosing a numerical vector representation\nfor each image. The model doesn't make an automatic mental association\nbetween the image of the handwritten digit and the numerical digit itself. \n**Click here for our answer**\n\nDue to similarities in shape, the vector representations of some of the\nskinnier, narrower `7`s are placed closer to the vectors for\nhandwritten `1`s. The same thing happens for some `8`s\nand `9`s, and even some of the `5`s and `3`s.\n\nThe handwritten digits on the outside of the projection space appear\nmore strongly definable as one of the nine digits and strongly differentiated\nfrom other possible digits.\n| **Key terms:**\n|\n| - [Embedding vector](/machine-learning/glossary#embedding-vector)\n- [Embedding space](/machine-learning/glossary#embedding-space) \n[Help Center](https://support.google.com/machinelearningeducation)"]]