嵌入:互動式練習
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
以下小工具採用 TensorFlow 的嵌入呈現工具做為基礎,將 10,000 word2vec
靜態向量壓縮成一個 3D 空間。壓縮後的維度可能會造成誤導,因為在原始高維度空間中彼此相距最近的點,在 3D 投影中可能會相隔較遠的距離。距離最相近的「n」點會以紫色醒目顯示,使用者所選的「n」會標示為 Isolate __ points。右側欄會顯示距離最近的鄰點。
在這些實驗中,您將使用上述小工具的 word2vec
嵌入。
任務 1
請試著找出 20 個與下列項目距離最近的鄰點,並查看其群組屬於雲端中的哪個部分。
iii
、third
和 three
tao
和 way
orange
、yellow
和 juice
您從結果中發現了什麼?
按這裡可查看答案
雖然 iii
、third
和 three
在語意上相似,但它們會分別出現在不同的文字語境中,且在這個嵌入空間中不會靠近彼此。在 word2vec
中,iii
距離 iv
較近,距離 third
較遠。
同樣地,雖然 way
是由 tao
直接轉譯而來,在使用的資料集中,這些字詞常常會與完全不同的字詞群組一起出現,因此這兩個向量會相距甚遠。
與 orange
最相近的前幾個鄰點為顏色,但與 orange
的水果涵義相關的 juice
和 peel
,則分別是排名第 14 和第 18 相近的鄰點。同時,取奧倫治親王涵義的 prince
則排名第 17 相近。在投影中,與 orange
最相近的字詞為 yellow
和其他顏色,而與 juice
最相近的字詞並不包含 orange
。
任務 2
請試著找出訓練資料的部分特性。比方說,找出距離下列項目最近的 100 個鄰點,並查看其群組屬於雲端中的哪部分:
boston
、paris
、tokyo
、delhi
、moscow
和 seoul
(留意題目中的陷阱)
jane
、sarah
、john
、peter
、rosa
和 juan
按這裡可查看答案
boston
有許多鄰點是美國的其他城市,而 paris
有許多鄰點是歐洲的其他城市。tokyo
和 delhi
看來沒有相似結果:一個的結果與全球旅遊中樞城市相關聯,另一個的結果則與 india
和其他相關字詞相關聯。seoul
則完全未出現在壓縮截取後的字詞向量組合中。
綜上所述,這個資料集有很多文件與美國境內地理位置相關,與歐洲境內地理位置有關的文件也有一些,但與其他國家/地區的詳細資訊則有限。
同樣地,這個資料集看起來包含很多男性英文名字、女性英文名字也有一些,而其他語言的名字則少很多。請注意,資料中包含迪士尼漫畫角色史高治麥克達克的創作者暨繪師唐羅薩,「史高治」和「麥克達克」很有可能因此成為「羅薩」最相近的鄰點。
由 word2vec
提供的預先訓練字詞向量,實際上是使用 2013 年以前的 Google 新聞報導進行訓練。
任務 3
嵌入並不僅限用於字詞,圖片、音訊和其他資料也可以嵌入。在此任務中,請按照下列步驟操作:
- 開啟 TensorFlow 的嵌入呈現工具。
- 在標題為「Data」(資料) 的左側欄中,選擇「Mnist with images」(使用圖片執行 MNIST),系統會隨即顯示手寫數字的 MNIST 資料庫嵌入投影。
- 點選停止輪播,並選擇單張圖片,然後視需要加以縮放。
- 查看右側欄顯示的最相近鄰點,有什麼意外的發現嗎?
- 為何部分與
7
距離最近的鄰點是 1
?而部分與 8
距離最近的鄰點則是 9
?
- 位於投影空間邊緣及中央的圖片是否有任何不同之處?
請注意,建立嵌入的模型會接收圖片資料 (即像素),並為每張圖片選擇數值向量表示法。模型不會自動推測手寫數字圖片與數字本身的關聯。
按這裡可查看答案
由於形狀相似,部分較細長的 7
向量表示法會比較靠近手寫 1
的向量。相同的情況也會發生在某些 8
和 9
,甚至是部分的 5
及 3
。
投影空間以外的手寫數字,較能清楚辨識為九個阿拉伯數字之一,且不易與其他數字混淆。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-05-16 (世界標準時間)。
[null,null,["上次更新時間:2025-05-16 (世界標準時間)。"],[],[],null,["# Embeddings: Interactive exercises\n\nThe following widget, based on TensorFlow's\n[Embedding Projector](https://projector.tensorflow.org/), flattens 10,000\n`word2vec` static vectors into a 3D space. This collapse of dimensions can be\nmisleading, because the points closest to each other in the original\nhigh-dimensional space may appear farther apart in the 3D projection. The\nclosest *n* points are highlighted in purple, with *n* chosen by the user in\n**Isolate __ points**. The sidebar on the right identifies those nearest\nneighbors. \n\nIn these experiments, you'll play with the `word2vec` embeddings in the widget\nabove.\n\nTask 1\n------\n\nTry to find the 20 nearest neighbors for the following, and see where the\ngroups fall in the cloud.\n\n- `iii`, `third`, and `three`\n- `tao` and `way`\n- `orange`, `yellow`, and `juice`\n\nWhat do you notice about these results? \n**Click here for our answer**\n\nEven though `iii`, `third`, and `three`\nare semantically similar, they appear in different contexts in text and\ndon't appear to be close together in this embedding space. In\n`word2vec`, `iii` is closer to `iv` than to\n`third`.\n\nSimilarly, while `way` is a direct translation of `tao`,\nthese words most frequently occur with completely different groups of words\nin the dataset used, and so the two vectors are very far apart.\n\nThe first several nearest neighbors of `orange` are colors, but\n`juice` and `peel`, related to the meaning of\n`orange` as fruit, show up as the 14th\nand 18th nearest neighbors. `prince`, meanwhile, as in the\nPrince of Orange, is 17th. In the projection, the words closest to\n`orange` are `yellow` and other\ncolors, while the closest words to `juice` don't include\n`orange`.\n\nTask 2\n------\n\nTry to figure out some characteristics of the training data. For example, try\nto find the 100 nearest neighbors for the following, and see where the groups\nare in the cloud:\n\n- `boston`, `paris`, `tokyo`, `delhi`, `moscow`, and `seoul` (this is a trick question)\n- `jane`, `sarah`, `john`, `peter`, `rosa`, and `juan`\n\n**Click here for our answer**\n\nMany of the nearest neighbors to `boston` are other cities in\nthe US. Many of the nearest neighbors to `paris` are other cities\nin Europe. `tokyo` and `delhi` don't seem to have\nsimilar results: one is associated with cities around the world that are\ntravel hubs, while the other is associated with `india` and related\nwords. `seoul` doesn't appear in this trimmed-down set of\nword vectors at all.\n\nIt seems that this dataset contains many documents related to US national\ngeography, some documents relate to European regional geography, and not\nmuch fine-grained coverage of other countries or regions.\n\nSimilarly, this dataset seems to contain many male English names, some female\nEnglish names, and far fewer names from other languages. Note that Don Rosa\nwrote and illustrated Scrooge McDuck comics for Disney, which is the likely\nreason that \\`scrooge\\` and \\`mcduck\\` are among the nearest neighbors for \\`rosa\\`.\n\nThe pre-trained word vectors offered by `word2vec` were in fact\ntrained on\n[Google News articles up to 2013](https://code.google.com/archive/p/word2vec/).\n\nTask 3\n------\n\nEmbeddings aren't limited to words. Images, audio, and other data can also be\nembedded. For this task:\n\n1. Open TensorFlow's [Embedding Projector](https://projector.tensorflow.org/).\n2. In the left sidebar titled **Data** , choose **Mnist with images** . This brings up a projection of the embeddings of the [MNIST](https://developers.google.com/machine-learning/glossary#mnist) database of handwritten digits.\n3. Click to stop the rotation and choose a single image. Zoom in and out as needed.\n4. Look in the right sidebar for nearest neighbors. Are there any surprises?\n\n- Why do some `7`s have `1`s as their nearest neighbor? Why do some `8`s have `9` as their nearest neighbor?\n- Is there anything about the images on the edges of the projection space that seem different from the images in the center of the projection space?\n\nKeep in mind that the model that created these embeddings is receiving image\ndata, which is to say, pixels, and choosing a numerical vector representation\nfor each image. The model doesn't make an automatic mental association\nbetween the image of the handwritten digit and the numerical digit itself. \n**Click here for our answer**\n\nDue to similarities in shape, the vector representations of some of the\nskinnier, narrower `7`s are placed closer to the vectors for\nhandwritten `1`s. The same thing happens for some `8`s\nand `9`s, and even some of the `5`s and `3`s.\n\nThe handwritten digits on the outside of the projection space appear\nmore strongly definable as one of the nine digits and strongly differentiated\nfrom other possible digits.\n| **Key terms:**\n|\n| - [Embedding vector](/machine-learning/glossary#embedding-vector)\n- [Embedding space](/machine-learning/glossary#embedding-space) \n[Help Center](https://support.google.com/machinelearningeducation)"]]